.. code:: ipython3

    # Execute this first 
    #
    #  * trigger notebook styling
    #  * check if notebook had been modified since its distribution
    # 
    # Note: executing any cells before this modifies the notebook.
    # 
    %run src/init_notebooks.py
    hide_toggle()
    check_notebook()

Calculating PMFs with AWH in GROMACS
====================================

::

   Author:   Viveca Lindahl (reviewed/updated by Alessandra Villa and Christian Blau)
   Goal:     Learn how to calculate potential of mean forces with accelerated weight histogram (AWH) method in GROMACS
   Prerequisites: Know how to run an MD simulation, basis knowledge of using a Linux shell, basis knowledge of thermodynamics and awareness of accelerated weight histogram method (see reference below).
   Time:      80 minutes
   Software:  GROMACS 2022 (2021, 2020, 2019, 2018), python modules: numpy, matplotlib, re, nglviewer (https://nglviewer.org/) 
   Optional Software: visualization software [VMD](https://www.ks.uiuc.edu/Research/vmd), Xmgrace plotting tool

Here we learn how to calculate the potential of mean force (PMF) along a
reaction coordinate (RC) using the accelerated weight histogram method
(AWH) in GROMACS. We will go through both how to set up the input files
needed, as well as how to extract and analyze the output after having
run the simulation.

AWH applies a time-dependent bias potential along the chosen RC, which
is tuned during the simulation such that it “flattens” the barriers of
the PMF to improve sampling along the RC. With better sampling, the PMF
can be calculated more accurately than using unbiased MD.

References
----------

For more details about the AWH method and how it can be applied we refer
to \* The GROMACS reference manual section about AWH, available on `the
GROMACS documentation
website <http://manual.gromacs.org/current/reference-manual/special/awh.html>`__
\* BioExcel Webinar: `Accelerating sampling in GROMACS with the AWH
method <https://bioexcel.eu/webinar-accelerating-sampling-in-gromacs-with-the-awh-method-2019-09-24/>`__
\* Lindahl, V., Lidmar, J. and Hess, B., 2014. Accelerated weight
histogram method for exploring free energy landscapes. *The Journal of
chemical physics*, 141(4), p.044110.
`doi:10.1063/1.4890371 <https://doi.org/10.1063/1.4890371>`__ \*
Lindahl, V., Gourdon, P., Andersson, M. and Hess, B., 2018. Permeability
and ammonia selectivity in aquaporin TIP2;1: linking structure to
function. *Scientific reports*, 8(1), p.2995.
`doi:10.1038/s41598-018-21357-2 <https://doi.org/10.1038/s41598-018-21357-2>`__
\* Lindahl, V., Villa, A. and Hess, B., 2017. Sequence dependency of
canonical base pair opening in the DNA double helix. *PLoS computational
biology*, 13(4), p.e1005463.
`doi:10.1371/journal.pcbi.1005463 <https://doi.org/10.1371/journal.pcbi.1005463>`__

The case study: DNA base pair opening
-------------------------------------

We will calculate a PMF for opening a DNA base pair. The DNA double
helix is a very stable structure. Opening a base pair requires breaking
hydrogen bonds between the bases and crossing a high free energy
barrier. That’s why we need to enhance the sampling by applying a bias!
As our reaction coordinate we use the distance between the two atoms
forming the central hydrogen-bond the two bases in a pair.

Let’s have a look at the system and the reaction coordinate. We have
hidden all the water to better see the DNA itself, and we have put
punchy colors on the atoms defining the RC of our target base pair. Now
run this and the DNA should pop up:

.. code:: ipython3

    import nglview as ng
    view = ng.show_file("visualization/dna-centered.pdb", default=False)
    view.center(selection="DA or DT")
    view.add_representation("cartoon")
    view.add_representation("base", selection="not DA and not DT")
    view.add_representation("ball+stick",selection="DA")
    view.add_representation("ball+stick",selection="DT")
    view.add_representation("ball+stick",selection="DA and .N1", color="green")
    view.add_representation("ball+stick",selection="DT and .N3", color="green")
    view
    # click and drag to rotate, zoom with your mouseweel 
    # for more infor on this viewer have a look at https://github.com/nglviewer/nglview

Rotate the structure and look for the two (nitrogen) atoms in green. The
distance between these will serve as our reaction coordinate for that
base pair.

In alternative, you can use VMD to visualize the reaction coordinate on
your local machine. The ``-e`` flag below tells VMD to excute the
commands that are in the following .tcl script. These commands change
how the system is visually represented. For instance, we have hidden all
the waters to better see the DNA itself, and we have put punchy colors
on the atoms defining the RC of our target base pair. To run VMD from
within this notebook, remove the comment character (#) in the following
cell and VMD should pop up:

.. code:: ipython3

    #!vmd visualization/dna-centered.gro -e visualization/representation.tcl

Rotate the structure and look for the two (nitrogen) atoms in green. The
distance between these will serve as our RC for that base pair. Quit VMD
to continue.

The MD parameter file
---------------------

We’ll now assume we have already built and equilibrated the system, so
we are almost ready to go. To use AWH we basically just need to add some
extra parameters in the .mdp file. Go to and check out the directory
that has all the run files of our first AWH example:

.. code:: ipython3

    %cd awh-simple
    !ls -l

Find the differences between the mdp file using AWH and an mdp file for
a vanilla MD simulation:

.. code:: ipython3

    !diff no-awh.mdp awh.mdp

Here ‘<’ refers to the content of the first argument to ``diff`` (the
non-AWH .mdp file) and ‘>’ to the second (the AWH .mdp file). E.g., we
increased the number of steps (``nsteps``) for AWH. The more relevant
parameters are the ones prefixed ``pull`` and ``awh``. The awh-``.mdp``
options are documented at
https://manual.gromacs.org/current/user-guide/mdp-options.html#mdp-awh.

The index file
--------------

The .mdp file now depends on some definitions of atom groups,
``base_N1orN3`` and ``partner_N1orN3``. We need to make an index (.ndx)
file defining these and provide as input when generating a .tpr file
with ``gmx grompp``. Here our groups are as simple as they get: each
group contains a single nitrogen atom (with names ‘N1’ or ‘N3’) . But
don’t get tempted to edit an index file manually! The traditional tool
to use is ``gmx make_ndx`` and a more general and powerful tool is
``gmx select``. We focus on AWH here and have provided the index file.
Double-check that the pull groups (e.g. ``pull-group1-name``) from the
.mdp file are actually defined in the index file:

.. code:: ipython3

    !grep -A 1  N1orN3  index.ndx  # '-A 1' to show also 1 line after the match 

One atom per group looks right. In a real study, a better check could be
to visualize the atom indices (e.g. with VMD).

Starting the simulation
-----------------------

Here we won’t run the actual simulation; a directory with data has
already been provided. But for practice, we will generate a tpr, as
usual with ``grompp``. The flag ``-quiet`` tells ``grompp`` to be less
verbose.

.. code:: ipython3

    !gmx grompp -quiet -f awh.mdp -n index.ndx -p topol.top -o topol.tpr

Note that the values of the pull coordinate values in the provided .gro
file are printed by ``grompp``. Does it look reasonable?

Check the log file
------------------

Now let’s look at the data we’ve provided. Some information related to
the AWH initial convergence can be found in the ``mdrun`` log file.

.. code:: ipython3

    !grep 'covering' data/md.log
    !grep 'out of the' data/md.log

Here, a “covering” refers to an end-to-end transition across the
sampling interval [``awh1-dim1-start`` , ``awh1-dim1-end``]. Note that
the first transitions are faster than the later ones. Why is that?

The AWH and initial stage and convergence
-----------------------------------------

The overall update size of the free energy estimate, here denoted by
:math:`|\Delta F|`, is a free parameter in AWH. It is very important
because converging the free energy relies on :math:`|\Delta F|\to 0` for
large times :math:`t`. Also, the time-dependent bias function is a
direct function of the current free energy estimate. Thus, initially
:math:`|\Delta F|` has to be large enough that the bias fluctuations
efficiently promote escaping free energy minima.

In the AWH *initial stage*, :math:`|\Delta F|`, is kept constant
*during* each transition across the sampling interval. After the
interval has been covered, the update size is scaled down by a constant
factor. This leads to a stepwise decrease of the update size in the
initial stage, which approaches an exponential decay
:math:`|\Delta F|\sim 1/e^{\alpha t}`. After exiting this stage, the
update size instead decreases steadily with time,
:math:`|\Delta F| \sim 1/t`. The latter has proven to be the right
long-term time evolution for the free energy to converge, but if applied
from the start it generally decreases the update size too quickly. The
exit time is determined dynamically by AWH. The exit of the initial
stage should occur before the initial exponential-like decay of
:math:`|\Delta F|` becomes more rapid than :math:`\sim1/t`.

The initial stage is a type of “burn-in” procedure designed to improve
the robustness of the method. The idea is to first get a quick and rough
free energy estimate in the initial stage, followed by refinement in the
final stage.

The reaction coordinate trajectory
----------------------------------

The trajectory of the pull coordinate is found in ``pullx.xvg``.

.. code:: ipython3

    %less data/pullx.xvg

For plotting and data analysis we provide a Python function to read the
``.xvg`` file format into this notebook.

.. code:: ipython3

    import numpy as np
    import matplotlib.pyplot as plt
    
    # Function that reads data from an .xvg file
    def read_xvg(fname):
        data=[]
        with open(fname) as f: 
            for line in f:
                # Lines with metadata or comments start with #, @
                if not line.startswith(("@","#")):
                    data.append(np.array([float(s) for s in line.split()]))
        data = np.vstack(data)
        return data

We can for instance plot the covering times.

.. code:: ipython3

    # Extract substrings using regular expressions
    import re
    
    # Read pullx.xvg file and plot a selected part of the trajectory
    rundir='data/'
    data = read_xvg(rundir + "/pullx.xvg")
    t, x = data[:,0], data[:,1]
    
    fig, ax=plt.subplots(figsize=(16,10))
    plt.plot(t, x, lw=1)
    
    # Set y-axis limits based on the chosen sampling interval 
    awh_start, awh_end = 0.25, 0.65    
    plt.ylim([awh_start, awh_end])
    ymin, ymax = plt.gca().get_ylim()
    
    # Find covering times plot them
    covering_logs = !grep covering {rundir}/md.log
    covering_times = [float(re.findall('t = (.+?) ps', log)[0]) for  log in covering_logs]
    
    for tcover, log in zip(covering_times, covering_logs):
        plt.plot([tcover]*2,[ymin,ymax], color='k', lw=2, ls='-')
    
    # Also mark the exit time if there was one
    exit_logs = !grep 'out of the' {rundir}/md.log
    if len(exit_logs) > 0:
        texit = float(re.findall('t = (.+?)\.', exit_logs[0])[0])
        plt.plot([texit],0.5*(ymin + ymax), color='k', marker='x', ms=10)
    
    # Tweak more
    tend=tcover*2
    plt.xlim([0,min(tend,max(t))])
    plt.xlabel("time (ps)")
    plt.ylabel("distance (nm)");
    plt.title('RC trajectory');

Here, the black vertical lines indicate covering times and the
:math:`\times` marks the exit time, extracted from ``md.log``. Verify
that these coincide roughly with the transition times seen in the
trajectory. What we expect to see here is that the covering times are
shorter initially, when the free energy estimate and bias fluctuates
more, and then increase and reach a constant average value as the bias
stops changing. You should always check this trajectory and make sure
that there are at least a few transitions after the initial stage. If
not, you need to simulate longer or you have a poorly chosen RC that is
not guiding the transitions efficiently.

The ``gmx awh`` tool and the ``awh.xvg`` files
----------------------------------------------

Data directly related to AWH is extracted using ``gmx awh``. The data is
stored in the energy file, ``ener.edr``. The default output file name is
of the format ``awh_t<time>.xvg``; there will be one file for each AWH
output time, or less if you use the ``-skip`` flag. Let’s dump the
output into a separate directory.

.. code:: ipython3

    !ls awh-data

.. code:: ipython3

    !mkdir -p awh-data
    !gmx awh -quiet -s topol.tpr -more -f data/ener.edr -kt -o awh-data/awh.xvg

.. code:: ipython3

    !ls awh-data

Above, the ``-more`` flag tells ``gmx awh`` to extract more than the PMF
from the AWH files, e.g. histograms. The default ``gmx`` unit of energy
is kJ/mol, but with the flag ``-kT`` we get units of kT.

Let’s have a quick look at the contents of one of these files (from time
100 ns). You can use ``xmgrace`` on your local machine to plot the file,
by removing the comment character (#) in the following cell to do that.
That will provide a detailed legend that helps understanding what data
is being plotted.

.. code:: ipython3

    t=100000 # ps
    # -nxy to plot multicolumn data
    awh_data = read_xvg("awh-data/awh_t{}.xvg".format(t)).T
    
    fig,ax = plt.subplots()
    ax.plot(awh_data[0],awh_data[1])
    ax.set_xlabel("RC(nm)")
    ax.set_ylabel("PMF(kT)")
    
    # uncomment the line below if you have xmgrace installed
    # !xmgrace -free -nxy awh-data/awh_t{t}.xvg # wait for it... 

Here you can have a look at the raw-data:

.. code:: ipython3

    %less awh-data/awh_t100000.xvg

Scrolling down to ``@ slegend ...`` gives you more details. We list the
entries in these files in more detail here:

+------------------+---------------+-----------------------------------+
| column           | title         | description                       |
+==================+===============+===================================+
| 1                | PMF           | the free energy estimate along    |
|                  |               | the RC, :math:`\xi(x)`            |
+------------------+---------------+-----------------------------------+
| 2                | Coord bias    | the bias acting on :math:`\xi`.   |
|                  |               | Note: :math:`\xi` is *attracted*  |
|                  |               | to large values of this bias      |
|                  |               | *function*. The bias *potential*  |
|                  |               | has opposite sign                 |
+------------------+---------------+-----------------------------------+
| 3                | Coord distr   | histogram of :math:`\xi`          |
+------------------+---------------+-----------------------------------+
| 4                | Ref value     | probability weight histogram of   |
|                  | distr         | the reference coordinate value,   |
|                  |               | :math:`\lambda`                   |
+------------------+---------------+-----------------------------------+
| 5                | Target ref    | the target distribution for       |
|                  | value distr   | :math:`\lambda`, uniform by       |
|                  |               | default (set by .mdp parameter    |
|                  |               | ``awh1-target``)                  |
+------------------+---------------+-----------------------------------+
| 6                | Friction      | a metric for local friction, or   |
|                  | metric        | 1/diffusion (here normalized and  |
|                  |               | unitless). Can pinpoint           |
|                  |               | hard-to-sample regions            |
+------------------+---------------+-----------------------------------+

Inspecting the results in more detail
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: ipython3

    fig,ax = plt.subplots(3,2, figsize=(16,10), sharex=True)
    ax[0,0].plot(awh_data[0],awh_data[1])
    ax[0,0].set_xlabel("RC(nm)")
    ax[0,0].set_ylabel("PMF(kT)")
    
    ax[0,1].plot(awh_data[0],awh_data[2])
    ax[0,1].set_xlabel("RC(nm)")
    ax[0,1].set_ylabel("Coord bias")
    ax[1,0].plot(awh_data[0],awh_data[3])
    ax[1,0].set_xlabel("RC(nm)")
    ax[1,0].set_ylabel("Coord distr")
    ax[1,1].plot(awh_data[0],awh_data[4])
    ax[1,1].set_xlabel("RC(nm)")
    ax[1,1].set_ylabel("Ref distr")
    
    ax[2,0].plot(awh_data[0],awh_data[5])
    ax[2,0].set_xlabel("RC(nm)")
    ax[2,0].set_ylabel("Target ref distr")
    ax[2,1].plot(awh_data[0],awh_data[6])
    ax[2,1].set_xlabel("RC(nm)")
    ax[2,1].set_ylabel("Friction")

Sanity checks for the distributions
-----------------------------------

(zoom in to see them better) \* *Coord distr* :math:`\approx` *Ref value
distr*, or the force constant (.mdp parameter
``awh1-dim1-force-constant``), which couples :math:`\lambda` to
:math:`\xi`, may be too weak.

Is this true everywhere in our case? E.g., for the smallest distances,
where atoms are repelling each other and the PMF is very steep, it
doesn’t hold entirely! In general, such end effects can be expected and
are not critical. Compare also the PMF and the bias function at the
ends!

-  *Ref value distr* :math:`\approx` *Target ref value distr*, i.e. that
   what we sampled is close to the distribution that we targeted. If not
   true, we may have a poor bias estimate/need to sample longer.

Evolution of the PMF estimate
-----------------------------

Now we look closer at the time-evolution of the PMF (you can do the same
for either variable in the awh.xvg file).

.. code:: ipython3

    # Plot PMFs from AWH xvg files
    
    # A list of all AWH files
    fnames = !ls awh-data/awh_t*xvg
    
    # Extract time of each file
    times = [float(re.findall('awh_t(.+?).xvg', fname)[0]) for fname in fnames]
    
    # Sort the files chronologically
    fnames = [f for (t,f) in sorted(zip(times, fnames))]
    times.sort()
    print("Number of files:", len(fnames))
    print("Time in between files ", times[1] - times[0], 'ps')
    print("Last time", times[-1], 'ps')
    
    # Plot the PMF from first files/times
    labels=[]
    istart = 0 # Start plotting from this file index
    nplot = 10  # Number of files to plot
    for fname in fnames[istart:istart+nplot]:
        data=read_xvg(fname)
        labels.append(re.findall('awh_t(.+?).xvg', fname)[0] + ' ps') # use the time as label
        plt.plot(data[:,0], data[:,1])
    plt.xlabel('distance (nm)')
    plt.ylabel('PMF (kT)')
    plt.xlim([0.25,0.60])
    plt.ylim([0,20])
    plt.legend(labels);

Try increasing the ``istart`` variable above to see how the PMF
estimates are changing (less and less) over time. *Note*: this
convergence does not easily translate into error estimates for the PMF.
To get such error bars the simplest is to run multiple (independent) AWH
simulations and calculate standard errors from that.

Congratulations! You are now ready to run an AWH simulation in GROMACS!
Let’s go back to the directory where we started and continue with a
somewhat more advanced setup.

Bias sharing walkers
--------------------

A simple way of parallelizing sampling with AWH is to launch multiple
simultaneous trajectories, so called “walkers”, and have them share a
common bias and free energy estimate. This can reduce the length of the
initial stage significantly. Running ``gmx mdrun -multidir`` with our
previous .tpr files would launch multiple *independent* simulations. To
make the AWH biases in the simulations aware of each other, and share
information, we need to add a couple of things in the .mdp file.

.. code:: ipython3

    %cd ..
    %cd awh-multi
    !ls -l

.. code:: ipython3

    !diff  ../awh-simple/awh.mdp awh-multi.mdp

Here, ``awh-share-multisim`` is just a switch to turn on the multisim
sharing. What is ``awh1-share-group``? Each bias that shares its
information with another bias needs to belong to a so called
``share-group``, indexed by :math:`1, 2,\ldots`. The reason is to remove
the ambiguity arising when are multiple AWH biases present within one
simulation (e.g. one bias per monomer of a multimeric protein). Since we
have only one bias here, setting this parameter is trivial. Finally, we
switched on ``awh1-equilibrate-histgram``. This will add an
“equilibration” stage before the initial stage making sure that the
sampled distribution approximately equals the target distribution before
attempting to decrease the update size. We’ll see this in action soon.

From a user-perspective, the main difference now is that the input and
output has been multiplied by the number of participating walkers. This
is true except for the main AWH output, which is shared!. Thus, you only
need to run ``gmx awh`` on one of the ``ener.edr`` files, they all
contain the same information. The friction metric is the exception here,
which currently is not shared for implementation reasons.

Let’s looks closer at the data we have been given

.. code:: ipython3

    !ls -l data

and the contents of these run directories

.. code:: ipython3

    !ls data/?

To figure out how this data was generated do

.. code:: ipython3

    !grep Command -A 1 data/0/md.log

Note that for ``-multidir`` you need a GROMACS build supporting MPI
communication, ``gmx_mpi``. The option -v means verbose, avoid to use
this option when you run on HPC cluster.

The initial stage for multiple walkers
--------------------------------------

We can repeat more or less the same analysis as we did in the
single-walker case

.. code:: ipython3

    #`-e` allows us to specify multiple search patterns. 
    !grep -e 'equilibrated' -e 'covering' -e 'out of the' data/0/md.log;

Verify that the covering times are shorter than in the single-walker
case above. Thus, the free energy update size decreases more rapidly,
This is because the sampling interval is now covered *cooperatively* by
all the walkers. Here a covering means that each point in the interval
``awh1-dim1-start`` to ``awh1-dim1-end`` has been sampled (with
sufficient probability) by one of the walkers. We can visualize what’s
going on by looking at the RC trajectories.

.. code:: ipython3

    # Extract substrings using regular expressions
    import re
    
    nwalkers=2
    rundirs=["data/" + str(i) for i in range(nwalkers)]
    
    # Data per walker
    for rund in rundirs:
        # Read pullx.xvg file and plot a selected part of the trajectory
        data = read_xvg(rund + "/pullx.xvg")
        t, x = data[:,0], data[:,1]
        plt.plot(t, x, lw=0.5)
    
    # Set y-axis limits based on the chosen sampling interval 
    awh_start, awh_end = 0.25, 0.65    
    plt.ylim([awh_start, awh_end])
    ymin, ymax = plt.gca().get_ylim()
    
    # Find covering times and plot them (note that these are shared for the walkers)
    covering_logs = !grep -e covered -e covering {rundirs[0]}/md.log
    covering_times = [float(re.findall('t = (.+?) ps', log)[0]) for  log in covering_logs]
    
    for tcover, log in zip(covering_times, covering_logs):
        if 'Decreased the update size'  in log:
            ls='-'
        else:
            # If the histogram equilibration was required (by mdp option 'awh1-equilibrate-histogram = true' )
            # then the covering will not take effect, i.e. update size will not decrease until the sampling interval 
            # has been sampled uniformly "enough".
            ls='--'
                    
        plt.plot([tcover]*2,[ymin,ymax], color='k', lw=2, ls=ls)
    
    # Also mark the exit time if there was one
    exit_logs = !grep 'out of the' {rundirs[0]}/md.log
    if len(exit_logs) > 0:
        texit = float(re.findall('t = (.+?)\.', exit_logs[0])[0])
        plt.plot([texit],0.5*(ymin + ymax), color='k', marker='x', ms=10)
    
    # Tweak more
    tend=tcover*2
    plt.xlim([0,min(tend,max(t))])
    plt.xlabel("time (ps)")
    plt.ylabel("distance (nm)");
    plt.title('RC trajectory');

The dashed line denotes the first actual covering. Leading question: is
the histogram still biased by the initial conditions at this time? Since
we had set ``awh1-equilibrate-histogram = true`` in the .mdp file the
detection of covering won’t take effect until the accumulating shared
histogram has become uniform enough, which happens at the first solid
line. Pre-equilibrating the histogram this way is nothing but a
heuristic to make the walkers spread out initially and loose the
“memory” of their initial configuration before attempting to converge
the free energy!

Verify also that now a “covering” is not equal to a full end-to-end
transition as it was in the single-walker case. Each walker just has to
transition across part of the full sampling interval. That is why the
initial stage becomes shorter.

To summarize: running multiple walkers has the potential of improving
the intitial convergence. But remember, to obtain statistical error
estimates you would generally generate *multiple* multiwalker
simulation, so don’t burn all the compute time at once!