# Execute this first
#
# * trigger notebook styling
# * check if notebook had been modified since its distribution
#
# Note: executing any cells before this modifies the notebook.
#
%run src/init_notebooks.py
hide_toggle()
check_notebook()
Calculating PMFs with AWH in GROMACS¶
Author: Viveca Lindahl (reviewed/updated by Alessandra Villa and Christian Blau)
Goal: Learn how to calculate potential of mean forces with accelerated weight histogram (AWH) method in GROMACS
Prerequisites: Know how to run an MD simulation, basis knowledge of using a Linux shell, basis knowledge of thermodynamics and awareness of accelerated weight histogram method (see reference below).
Time: 80 minutes
Software: GROMACS 2022 (2021, 2020, 2019, 2018), python modules: numpy, matplotlib, re, nglviewer (https://nglviewer.org/)
Optional Software: visualization software [VMD](https://www.ks.uiuc.edu/Research/vmd), Xmgrace plotting tool
Here we learn how to calculate the potential of mean force (PMF) along a reaction coordinate (RC) using the accelerated weight histogram method (AWH) in GROMACS. We will go through both how to set up the input files needed, as well as how to extract and analyze the output after having run the simulation.
AWH applies a time-dependent bias potential along the chosen RC, which is tuned during the simulation such that it “flattens” the barriers of the PMF to improve sampling along the RC. With better sampling, the PMF can be calculated more accurately than using unbiased MD.
References¶
For more details about the AWH method and how it can be applied we refer to * The GROMACS reference manual section about AWH, available on the GROMACS documentation website * BioExcel Webinar: Accelerating sampling in GROMACS with the AWH method * Lindahl, V., Lidmar, J. and Hess, B., 2014. Accelerated weight histogram method for exploring free energy landscapes. The Journal of chemical physics, 141(4), p.044110. doi:10.1063/1.4890371 * Lindahl, V., Gourdon, P., Andersson, M. and Hess, B., 2018. Permeability and ammonia selectivity in aquaporin TIP2;1: linking structure to function. Scientific reports, 8(1), p.2995. doi:10.1038/s41598-018-21357-2 * Lindahl, V., Villa, A. and Hess, B., 2017. Sequence dependency of canonical base pair opening in the DNA double helix. PLoS computational biology, 13(4), p.e1005463. doi:10.1371/journal.pcbi.1005463
The case study: DNA base pair opening¶
We will calculate a PMF for opening a DNA base pair. The DNA double helix is a very stable structure. Opening a base pair requires breaking hydrogen bonds between the bases and crossing a high free energy barrier. That’s why we need to enhance the sampling by applying a bias! As our reaction coordinate we use the distance between the two atoms forming the central hydrogen-bond the two bases in a pair.
Let’s have a look at the system and the reaction coordinate. We have hidden all the water to better see the DNA itself, and we have put punchy colors on the atoms defining the RC of our target base pair. Now run this and the DNA should pop up:
import nglview as ng
view = ng.show_file("visualization/dna-centered.pdb", default=False)
view.center(selection="DA or DT")
view.add_representation("cartoon")
view.add_representation("base", selection="not DA and not DT")
view.add_representation("ball+stick",selection="DA")
view.add_representation("ball+stick",selection="DT")
view.add_representation("ball+stick",selection="DA and .N1", color="green")
view.add_representation("ball+stick",selection="DT and .N3", color="green")
view
# click and drag to rotate, zoom with your mouseweel
# for more infor on this viewer have a look at https://github.com/nglviewer/nglview
Rotate the structure and look for the two (nitrogen) atoms in green. The distance between these will serve as our reaction coordinate for that base pair.
In alternative, you can use VMD to visualize the reaction coordinate on
your local machine. The -e
flag below tells VMD to excute the
commands that are in the following .tcl script. These commands change
how the system is visually represented. For instance, we have hidden all
the waters to better see the DNA itself, and we have put punchy colors
on the atoms defining the RC of our target base pair. To run VMD from
within this notebook, remove the comment character (#) in the following
cell and VMD should pop up:
#!vmd visualization/dna-centered.gro -e visualization/representation.tcl
Rotate the structure and look for the two (nitrogen) atoms in green. The distance between these will serve as our RC for that base pair. Quit VMD to continue.
The MD parameter file¶
We’ll now assume we have already built and equilibrated the system, so we are almost ready to go. To use AWH we basically just need to add some extra parameters in the .mdp file. Go to and check out the directory that has all the run files of our first AWH example:
%cd awh-simple
!ls -l
Find the differences between the mdp file using AWH and an mdp file for a vanilla MD simulation:
!diff no-awh.mdp awh.mdp
Here ‘<’ refers to the content of the first argument to diff
(the
non-AWH .mdp file) and ‘>’ to the second (the AWH .mdp file). E.g., we
increased the number of steps (nsteps
) for AWH. The more relevant
parameters are the ones prefixed pull
and awh
. The awh-.mdp
options are documented at
https://manual.gromacs.org/current/user-guide/mdp-options.html#mdp-awh.
The index file¶
The .mdp file now depends on some definitions of atom groups,
base_N1orN3
and partner_N1orN3
. We need to make an index (.ndx)
file defining these and provide as input when generating a .tpr file
with gmx grompp
. Here our groups are as simple as they get: each
group contains a single nitrogen atom (with names ‘N1’ or ‘N3’) . But
don’t get tempted to edit an index file manually! The traditional tool
to use is gmx make_ndx
and a more general and powerful tool is
gmx select
. We focus on AWH here and have provided the index file.
Double-check that the pull groups (e.g. pull-group1-name
) from the
.mdp file are actually defined in the index file:
!grep -A 1 N1orN3 index.ndx # '-A 1' to show also 1 line after the match
One atom per group looks right. In a real study, a better check could be to visualize the atom indices (e.g. with VMD).
Starting the simulation¶
Here we won’t run the actual simulation; a directory with data has
already been provided. But for practice, we will generate a tpr, as
usual with grompp
. The flag -quiet
tells grompp
to be less
verbose.
!gmx grompp -quiet -f awh.mdp -n index.ndx -p topol.top -o topol.tpr
Note that the values of the pull coordinate values in the provided .gro
file are printed by grompp
. Does it look reasonable?
Check the log file¶
Now let’s look at the data we’ve provided. Some information related to
the AWH initial convergence can be found in the mdrun
log file.
!grep 'covering' data/md.log
!grep 'out of the' data/md.log
Here, a “covering” refers to an end-to-end transition across the
sampling interval [awh1-dim1-start
, awh1-dim1-end
]. Note that
the first transitions are faster than the later ones. Why is that?
The AWH and initial stage and convergence¶
The overall update size of the free energy estimate, here denoted by \(|\Delta F|\), is a free parameter in AWH. It is very important because converging the free energy relies on \(|\Delta F|\to 0\) for large times \(t\). Also, the time-dependent bias function is a direct function of the current free energy estimate. Thus, initially \(|\Delta F|\) has to be large enough that the bias fluctuations efficiently promote escaping free energy minima.
In the AWH initial stage, \(|\Delta F|\), is kept constant during each transition across the sampling interval. After the interval has been covered, the update size is scaled down by a constant factor. This leads to a stepwise decrease of the update size in the initial stage, which approaches an exponential decay \(|\Delta F|\sim 1/e^{\alpha t}\). After exiting this stage, the update size instead decreases steadily with time, \(|\Delta F| \sim 1/t\). The latter has proven to be the right long-term time evolution for the free energy to converge, but if applied from the start it generally decreases the update size too quickly. The exit time is determined dynamically by AWH. The exit of the initial stage should occur before the initial exponential-like decay of \(|\Delta F|\) becomes more rapid than \(\sim1/t\).
The initial stage is a type of “burn-in” procedure designed to improve the robustness of the method. The idea is to first get a quick and rough free energy estimate in the initial stage, followed by refinement in the final stage.
The reaction coordinate trajectory¶
The trajectory of the pull coordinate is found in pullx.xvg
.
%less data/pullx.xvg
For plotting and data analysis we provide a Python function to read the
.xvg
file format into this notebook.
import numpy as np
import matplotlib.pyplot as plt
# Function that reads data from an .xvg file
def read_xvg(fname):
data=[]
with open(fname) as f:
for line in f:
# Lines with metadata or comments start with #, @
if not line.startswith(("@","#")):
data.append(np.array([float(s) for s in line.split()]))
data = np.vstack(data)
return data
We can for instance plot the covering times.
# Extract substrings using regular expressions
import re
# Read pullx.xvg file and plot a selected part of the trajectory
rundir='data/'
data = read_xvg(rundir + "/pullx.xvg")
t, x = data[:,0], data[:,1]
fig, ax=plt.subplots(figsize=(16,10))
plt.plot(t, x, lw=1)
# Set y-axis limits based on the chosen sampling interval
awh_start, awh_end = 0.25, 0.65
plt.ylim([awh_start, awh_end])
ymin, ymax = plt.gca().get_ylim()
# Find covering times plot them
covering_logs = !grep covering {rundir}/md.log
covering_times = [float(re.findall('t = (.+?) ps', log)[0]) for log in covering_logs]
for tcover, log in zip(covering_times, covering_logs):
plt.plot([tcover]*2,[ymin,ymax], color='k', lw=2, ls='-')
# Also mark the exit time if there was one
exit_logs = !grep 'out of the' {rundir}/md.log
if len(exit_logs) > 0:
texit = float(re.findall('t = (.+?)\.', exit_logs[0])[0])
plt.plot([texit],0.5*(ymin + ymax), color='k', marker='x', ms=10)
# Tweak more
tend=tcover*2
plt.xlim([0,min(tend,max(t))])
plt.xlabel("time (ps)")
plt.ylabel("distance (nm)");
plt.title('RC trajectory');
Here, the black vertical lines indicate covering times and the
\(\times\) marks the exit time, extracted from md.log
. Verify
that these coincide roughly with the transition times seen in the
trajectory. What we expect to see here is that the covering times are
shorter initially, when the free energy estimate and bias fluctuates
more, and then increase and reach a constant average value as the bias
stops changing. You should always check this trajectory and make sure
that there are at least a few transitions after the initial stage. If
not, you need to simulate longer or you have a poorly chosen RC that is
not guiding the transitions efficiently.
The gmx awh
tool and the awh.xvg
files¶
Data directly related to AWH is extracted using gmx awh
. The data is
stored in the energy file, ener.edr
. The default output file name is
of the format awh_t<time>.xvg
; there will be one file for each AWH
output time, or less if you use the -skip
flag. Let’s dump the
output into a separate directory.
!ls awh-data
!mkdir -p awh-data
!gmx awh -quiet -s topol.tpr -more -f data/ener.edr -kt -o awh-data/awh.xvg
!ls awh-data
Above, the -more
flag tells gmx awh
to extract more than the PMF
from the AWH files, e.g. histograms. The default gmx
unit of energy
is kJ/mol, but with the flag -kT
we get units of kT.
Let’s have a quick look at the contents of one of these files (from time
100 ns). You can use xmgrace
on your local machine to plot the file,
by removing the comment character (#) in the following cell to do that.
That will provide a detailed legend that helps understanding what data
is being plotted.
t=100000 # ps
# -nxy to plot multicolumn data
awh_data = read_xvg("awh-data/awh_t{}.xvg".format(t)).T
fig,ax = plt.subplots()
ax.plot(awh_data[0],awh_data[1])
ax.set_xlabel("RC(nm)")
ax.set_ylabel("PMF(kT)")
# uncomment the line below if you have xmgrace installed
# !xmgrace -free -nxy awh-data/awh_t{t}.xvg # wait for it...
Here you can have a look at the raw-data:
%less awh-data/awh_t100000.xvg
Scrolling down to @ slegend ...
gives you more details. We list the
entries in these files in more detail here:
column |
title |
description |
---|---|---|
1 |
PMF |
the free energy estimate along the RC, \(\xi(x)\) |
2 |
Coord bias |
the bias acting on \(\xi\). Note: \(\xi\) is attracted to large values of this bias function. The bias potential has opposite sign |
3 |
Coord distr |
histogram of \(\xi\) |
4 |
Ref value distr |
probability weight histogram of the reference coordinate value, \(\lambda\) |
5 |
Target ref value distr |
the target distribution for
\(\lambda\), uniform by
default (set by .mdp parameter
|
6 |
Friction metric |
a metric for local friction, or 1/diffusion (here normalized and unitless). Can pinpoint hard-to-sample regions |
Inspecting the results in more detail¶
fig,ax = plt.subplots(3,2, figsize=(16,10), sharex=True)
ax[0,0].plot(awh_data[0],awh_data[1])
ax[0,0].set_xlabel("RC(nm)")
ax[0,0].set_ylabel("PMF(kT)")
ax[0,1].plot(awh_data[0],awh_data[2])
ax[0,1].set_xlabel("RC(nm)")
ax[0,1].set_ylabel("Coord bias")
ax[1,0].plot(awh_data[0],awh_data[3])
ax[1,0].set_xlabel("RC(nm)")
ax[1,0].set_ylabel("Coord distr")
ax[1,1].plot(awh_data[0],awh_data[4])
ax[1,1].set_xlabel("RC(nm)")
ax[1,1].set_ylabel("Ref distr")
ax[2,0].plot(awh_data[0],awh_data[5])
ax[2,0].set_xlabel("RC(nm)")
ax[2,0].set_ylabel("Target ref distr")
ax[2,1].plot(awh_data[0],awh_data[6])
ax[2,1].set_xlabel("RC(nm)")
ax[2,1].set_ylabel("Friction")
Sanity checks for the distributions¶
(zoom in to see them better) * Coord distr \(\approx\) Ref value
distr, or the force constant (.mdp parameter
awh1-dim1-force-constant
), which couples \(\lambda\) to
\(\xi\), may be too weak.
Is this true everywhere in our case? E.g., for the smallest distances, where atoms are repelling each other and the PMF is very steep, it doesn’t hold entirely! In general, such end effects can be expected and are not critical. Compare also the PMF and the bias function at the ends!
Ref value distr \(\approx\) Target ref value distr, i.e. that what we sampled is close to the distribution that we targeted. If not true, we may have a poor bias estimate/need to sample longer.
Evolution of the PMF estimate¶
Now we look closer at the time-evolution of the PMF (you can do the same for either variable in the awh.xvg file).
# Plot PMFs from AWH xvg files
# A list of all AWH files
fnames = !ls awh-data/awh_t*xvg
# Extract time of each file
times = [float(re.findall('awh_t(.+?).xvg', fname)[0]) for fname in fnames]
# Sort the files chronologically
fnames = [f for (t,f) in sorted(zip(times, fnames))]
times.sort()
print("Number of files:", len(fnames))
print("Time in between files ", times[1] - times[0], 'ps')
print("Last time", times[-1], 'ps')
# Plot the PMF from first files/times
labels=[]
istart = 0 # Start plotting from this file index
nplot = 10 # Number of files to plot
for fname in fnames[istart:istart+nplot]:
data=read_xvg(fname)
labels.append(re.findall('awh_t(.+?).xvg', fname)[0] + ' ps') # use the time as label
plt.plot(data[:,0], data[:,1])
plt.xlabel('distance (nm)')
plt.ylabel('PMF (kT)')
plt.xlim([0.25,0.60])
plt.ylim([0,20])
plt.legend(labels);
Try increasing the istart
variable above to see how the PMF
estimates are changing (less and less) over time. Note: this
convergence does not easily translate into error estimates for the PMF.
To get such error bars the simplest is to run multiple (independent) AWH
simulations and calculate standard errors from that.
Congratulations! You are now ready to run an AWH simulation in GROMACS! Let’s go back to the directory where we started and continue with a somewhat more advanced setup.
Bias sharing walkers¶
A simple way of parallelizing sampling with AWH is to launch multiple
simultaneous trajectories, so called “walkers”, and have them share a
common bias and free energy estimate. This can reduce the length of the
initial stage significantly. Running gmx mdrun -multidir
with our
previous .tpr files would launch multiple independent simulations. To
make the AWH biases in the simulations aware of each other, and share
information, we need to add a couple of things in the .mdp file.
%cd ..
%cd awh-multi
!ls -l
!diff ../awh-simple/awh.mdp awh-multi.mdp
Here, awh-share-multisim
is just a switch to turn on the multisim
sharing. What is awh1-share-group
? Each bias that shares its
information with another bias needs to belong to a so called
share-group
, indexed by \(1, 2,\ldots\). The reason is to remove
the ambiguity arising when are multiple AWH biases present within one
simulation (e.g. one bias per monomer of a multimeric protein). Since we
have only one bias here, setting this parameter is trivial. Finally, we
switched on awh1-equilibrate-histgram
. This will add an
“equilibration” stage before the initial stage making sure that the
sampled distribution approximately equals the target distribution before
attempting to decrease the update size. We’ll see this in action soon.
From a user-perspective, the main difference now is that the input and
output has been multiplied by the number of participating walkers. This
is true except for the main AWH output, which is shared!. Thus, you only
need to run gmx awh
on one of the ener.edr
files, they all
contain the same information. The friction metric is the exception here,
which currently is not shared for implementation reasons.
Let’s looks closer at the data we have been given
!ls -l data
and the contents of these run directories
!ls data/?
To figure out how this data was generated do
!grep Command -A 1 data/0/md.log
Note that for -multidir
you need a GROMACS build supporting MPI
communication, gmx_mpi
. The option -v means verbose, avoid to use
this option when you run on HPC cluster.
The initial stage for multiple walkers¶
We can repeat more or less the same analysis as we did in the single-walker case
#`-e` allows us to specify multiple search patterns.
!grep -e 'equilibrated' -e 'covering' -e 'out of the' data/0/md.log;
Verify that the covering times are shorter than in the single-walker
case above. Thus, the free energy update size decreases more rapidly,
This is because the sampling interval is now covered cooperatively by
all the walkers. Here a covering means that each point in the interval
awh1-dim1-start
to awh1-dim1-end
has been sampled (with
sufficient probability) by one of the walkers. We can visualize what’s
going on by looking at the RC trajectories.
# Extract substrings using regular expressions
import re
nwalkers=2
rundirs=["data/" + str(i) for i in range(nwalkers)]
# Data per walker
for rund in rundirs:
# Read pullx.xvg file and plot a selected part of the trajectory
data = read_xvg(rund + "/pullx.xvg")
t, x = data[:,0], data[:,1]
plt.plot(t, x, lw=0.5)
# Set y-axis limits based on the chosen sampling interval
awh_start, awh_end = 0.25, 0.65
plt.ylim([awh_start, awh_end])
ymin, ymax = plt.gca().get_ylim()
# Find covering times and plot them (note that these are shared for the walkers)
covering_logs = !grep -e covered -e covering {rundirs[0]}/md.log
covering_times = [float(re.findall('t = (.+?) ps', log)[0]) for log in covering_logs]
for tcover, log in zip(covering_times, covering_logs):
if 'Decreased the update size' in log:
ls='-'
else:
# If the histogram equilibration was required (by mdp option 'awh1-equilibrate-histogram = true' )
# then the covering will not take effect, i.e. update size will not decrease until the sampling interval
# has been sampled uniformly "enough".
ls='--'
plt.plot([tcover]*2,[ymin,ymax], color='k', lw=2, ls=ls)
# Also mark the exit time if there was one
exit_logs = !grep 'out of the' {rundirs[0]}/md.log
if len(exit_logs) > 0:
texit = float(re.findall('t = (.+?)\.', exit_logs[0])[0])
plt.plot([texit],0.5*(ymin + ymax), color='k', marker='x', ms=10)
# Tweak more
tend=tcover*2
plt.xlim([0,min(tend,max(t))])
plt.xlabel("time (ps)")
plt.ylabel("distance (nm)");
plt.title('RC trajectory');
The dashed line denotes the first actual covering. Leading question: is
the histogram still biased by the initial conditions at this time? Since
we had set awh1-equilibrate-histogram = true
in the .mdp file the
detection of covering won’t take effect until the accumulating shared
histogram has become uniform enough, which happens at the first solid
line. Pre-equilibrating the histogram this way is nothing but a
heuristic to make the walkers spread out initially and loose the
“memory” of their initial configuration before attempting to converge
the free energy!
Verify also that now a “covering” is not equal to a full end-to-end transition as it was in the single-walker case. Each walker just has to transition across part of the full sampling interval. That is why the initial stage becomes shorter.
To summarize: running multiple walkers has the potential of improving the intitial convergence. But remember, to obtain statistical error estimates you would generally generate multiple multiwalker simulation, so don’t burn all the compute time at once!