%run src/init_notebooks.py

# Solvation free energy of ethanol using AWH method ethanol-water.png

Author:   Alessandra Villa
Goal:     Learn how to run a solvation free energy computations for a small solute using alchemical transformation and accelerated weighted histogram (AWH) method
Prerequisites: Know how to run an MD simulation, basis knowledge of thermodynamics
Time:      90 minutes
Software:  GROMACS 2022
Version:   1.0

## Background Change in free energy is a very important physical quantity that determines the direction of many biological processes. For instance, the solubility of a molecule in a solvent is determined by the solvation free energy - a change in free energy when taking a solute molecule from vacuum into the solvent. Binding free energy is very important when investigating the binding affinity of ligands (or drugs) to a specific binding site on the receptor. In this tutorial we will learn how to use accelerated weighted histogram (AWH) to evaluate difference in free energy of two states - alchemical free energy evaluation. To know about more the method, watch the webinar: Applying the Accelerated Weight Histogram method to alchemical transformations.

For simplicity, we will focus on the free energy of solvation for a small molecule (ethanol), but similar procedures are used to calculate binding free energy. We will start with some background on how the free energies can be calculated in molecular dynamics simulations, and how the evaluation of free energy of solvation is one of the steps, necessary to calculate free energy of binding. Then, we will focus on the practicalities of doing such a calculation in GROMACS.

Calculating free energy

Assume that we need to compute the difference in free energy of two states \(A\) and \(B\). Let us also assume that \(A\) and \(B\) are well separated in the phase space and direct comparison between them is problematic. The trick then is to construct a set of intermediate states that are close to one another in the phase space. The proximity of the states allows one to compare them directly, computing the free energy difference between two consequent intemediate states. Then, going from one state to another, one can complete the full path between the states \(A\) and \(B\) and the sum of the differences between consequentive intermediate states will give the difference in free energies between \(A\) and \(B\).

For example, to calculate the binding free energy of the ligand to a receptor, we ultimately need to compare the state, where the ligand is bound to the receptor, to the state where both the ligand and the receptor are separated in solution:



This can be calculated directly, for example by slowly dragging the ligand away from the receptor and computing the dependence of average force between them. If the system will have enough time to explore all possible states while the ligand is pulled away, the integral of the force over the distance will give the free energy profile (usually called potential of mean force). However, the simulation times needed for convergence of this method are usually very long and when convergence is not reached, the fluctuations in forces can be very large. This makes the potential of mean force methods much more expensive than using free energy perturbation or alchemical free energy computation methods such as the Bennett Acceptance Ratio (BAR) or accelerated weighted histogram (AWH).

Remember that a free energy difference between two states \(A\) and \(B\) (\(F_B - F_A\)) determines their relative probability \(p_A\) and \(p_B\),

\[\frac{p_A}{p_B} = exp \frac{F_B - F_A}{k_BT}\]

where \(k_B\) is Boltzmann’s constant that relates thermal energy to the temperature (\(k_B = 1.38·10^{-23}\) J/K), and \(T\) is the temperature. We could, in principle, calculate a free energy difference by observing the system long enough, and measuring how often the system is in which state (i.e. measuring probabilities \(p_A\) and \(p_B\)). The free energy differences, however, are often of the order of tens of kJ/mol. For example, the free energy of solvation of ethanol at 298K is -20.1 kJ/mol, which is equivalent to -8.1 \(k_BT\) or to a relative probability of being in one of the states of \(3·10^{-4}\). We would need to wait a long time for that transition to occur spontaneously, and even longer to get good statistics on it. To shorten the time needed for convergence, free energy methods rely on one basic idea: to force the system to where it doesn’t want to be, and then measure by how much it doesn’t want to be there. In potential of mean force, we apply mechanical force to decouple ligand and receptor. In alchemical free energy computation methods, we transform the system through a set of unphysical (alchemical) states, that connect two physical states we are interested in. To do so, the interaction strength between a molecule of interest and the rest of the system is coupled to a variable \(\lambda\):

\[E_{total} = E_{ligand-ligand} + E_{rest-rest} + \lambda E_{ligand-rest}\]

and we slowly turn \(\lambda\) from 1 to 0. This means we can effectively turn off a ligand, and pretend that it is no longer coupled to a system (at \(\lambda=0\)). This way we force the system to where it does not want to be (either to the solvated or to the vacuum state, depending on what the sign of the free energy difference is).

The method of coupling and de-coupling the receptor and ligand can help us with calculating the free energy of binding, because now we can create a two-step path: fep.png where we first de-couple the ligand from the solvent (top panel), and then re-solvate the ligand in the presence of the receptor (bottom panel). The free energy of binding is thus

\[\Delta G_{binding} = \Delta G_1 + \Delta G_2\]

and the simulation is split into two parts: one is calculating the de-solvation free energy, and the other is calculating the free energy of introducing the ligand into the binding site on the receptor. The first simulation is the inverse of a free energy of solvation. That last simulation couples the ligand from \(\lambda=0\) where it doesn’t interact with the system, to the state with \(\lambda=1\), where the ligand is bound to the receptor. In this tutorial we will perform only the first step of calculating the solvation energy. Partly for computational performance reasons: because there is no protein involved, the simulation box size can be small and the simulations will be fast.

## Free energy of solvation

To calculate a free energy of solvation, we calculate \(-\Delta G_1\) in the picture above, or, equivalently, \(\Delta G_{solv}\) in this picture: fe-solvation.png We’ll do this by transforming the system through a set of unphysical (alchemical) step, that connect two physical states and use AWH to move dynamically between the states. (see J. Chem. Phys. 154, 204103 (2021) for details). In a single simulation \(\lambda\) moves dynamically using Monte Carlo, a bias potential is added to \(\lambda\) to achieve uniform sampling. Hamiltonian difference are computed to all the other \(\lambda\) values

The free energy difference can be calculated directly if the \(\lambda\) states are close enough. The term ‘close enough’ here means that switching between the two states should be possible in both directions and some of the configurations of the system should be allowed in both states (i.e. the states should share some parts of phase space).

The most obvious value for first and last \(\lambda\) point would be \(\lambda=0\) and \(\lambda=1\). These end states, however, usually have very few system conformations in common: they share very little phase space. Because of this, we will never see enough transitions from one state to the other and the free energy computations will never converge in this case. That is why the transition is usually split into several intermediate unphysical states: fep-coupling.png with as many \(\lambda\) points as possible for the convergence between two nearby states (i.e. values of \(\lambda\)). We will therefore effectively ‘slowly’ turn on (or off) the interactions between the solute and the solvent. Then we will process the data from all these simulations, comparing the nearby values of \(\lambda\).

Here we will use 13 \(\lambda\) points: 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95 and 1. Combining the results of all the \(\lambda\)-point will give us the total difference in free energy (i.e. the difference between the states with \(\lambda=0\) and \(\lambda=1\)).

Compare to the state-of-the-art way of doing the free energy perturbation simulations, we will take one shortcut in this tutorial. We will turn off both the electrostatic (Coulomb) interactions and the Van der Waals (Lennard-Jones) interactions at the same time. For high-quality results, these stages are normally separated, but here we will do both of them at the same time for expediency. GROMACS uses ‘soft-core’ interactions to make sure that while the normal (Lennard-Jones and Coulomb) interactions are being turned off, there will never be two point charges sitting on top of each other: this is achieved by turning on an interaction that effectively repels particles at intermediate \(\lambda\) points (in such a way that it cancels out from the free energy difference). For details on the soft-core interactions see here.

## Preparing the system Look for a file named topol.top, and a very basic coordinate file named ethanol.gro in the input files folder. This topology uses the OPLS force field and defines an ethanol molecule, and includes the definitions for SPC/E water, that will be used for solvation.

!cat input_files/topol.top
!cat input_files/ethanol.gro

Question: Take a look at the topology file topol.top. For the ethanol molecule definition, can you find which atoms are there, and how they are connected? Note, that we use standard OPLS atom types for the atoms in ethanol. These parameters were borrowed from the threonine side-chain, which has the same structure as the ethanol molecule. For more complex organic molecules one will have to look or to generate appropriate topology and coordinate files.

Before we start preparing the system, let us introduce a small piece of code that will allow us to interact with the command shell and filesystem:

import os,shutil

#simple function to write mdp files
def write_mdp(mdp_string, mdp_name, directory):

#get the path to the working directory

#path to gromacs files
if not os.path.isdir(output_files):
%cd output_files

The write_mdp(…) function, defined here will be used to create mdp configuration files directly from this notebook. With all input files copied to the working directory, we can safely start working. Note, that you can always reset this tutorial by deleting the ‘output_files’ folder and executing the box above. Below, all the lines starting with ! indicate the shell commands.

First, we need to define the simulation box, which later will be filled with water solvent. The box should fully incorporate the solute with some space to spare:

!gmx editconf -f ethanol.gro -o box.gro -bt dodecahedron -d 1.2

This sets up the simulation box as a rhombic dodecahedron with a minimum distance between the solute (the ethanol molecule) and the box edge of 1.2 nm. The rhombic dodecahedron shape of the box provides more effective packing of periodic images in comparison to rectangular boxes and hence, fewer water molecules are needed to separate the periodic images of the ethanol molecule. See the [GROMACS manual] (http://manual.gromacs.org/current/reference-manual/algorithms/periodic-boundary-conditions.html) for illustrations of this box shape and how its periodic images are arranged.

The gmx editconf command defines the box, without adding the solvent to it. The later is done by the gmx solvate command below.

!gmx solvate -cp box.gro -cs -o solvated.gro -p topol.top

This will fill the box we created on the previous step with water molecules (roughly 500 of them). The solute will be taken from the file listed in the -cp option. The -cs option can be used to specify the template system that contais solvent molecules only. Leaving it blank will make GROMACS use the default solvent layout. The command will add the coordinates for the water molecules to the end of the gro file and will update the topology. You are strongly encouraged to investigate corresponding files.

The initial placement of water molecules is not ideal and may contain atoms that are too close to one another or small patches of empty spaces. The coordinates of the solute atoms may come from the experiment, where the placement of atoms is not precise. These, very small discrepancies in the atom positioning can carry a very substantial excess of potential energy and, as a result, the system will have very high interatomic forces. So high that the numerical integration of equation of motion will be unstable even for a very short timesteps. To overcome this and make the system configuration suitable for simulation, we will first perform the energy minimization. Executing the following cell will create the mdp file with the instructions to GROMACS to perform the energy minimization. The file will then be preprocessed by grompp into a em.tpr run file - the file that contains all the data necessary to perform the minimization, including coordinates of the atoms, topology and simulation protocol. Running the mdrun on the last line will execute the energy mimimization simulations.

em_mdp=""";minimal mdp options for energy minimization
integrator               = steep
nsteps                   = 500
coulombtype              = pme
!gmx grompp -f em.mdp -c solvated.gro -o em.tpr
!gmx mdrun -v -deffnm em

## Equilibration of energy We are now ready to perform energy equilibration, a step necessary to achieve thermal equilibrium and uniformly distributh the energy over the system. For this step we will turn on pressure and temperature coupling: we are about to compute the difference in Gibbs free energy, and for that, the system must maintain both constant temperature and pressure, while the ethanol molecule is being de-coupled. For that, we will be using the velocity rescale thermostat, and the C-rescale barostat (here more on thermostat and barostat). The global equilibration is done with equil.mdp config file, constructed below. As with the mimization, the tpr file is then constructed and passed to GROMACS to run simulations.

equil_mdp=""";equilibration mdp options
integrator               = md
nsteps                   = 100000
dt                       = 0.002
nstenergy                = 100
rlist                    = 1.1
nstlist                  = 10
rvdw                     = 1.1
coulombtype              = pme
rcoulomb                 = 1.1
fourierspacing           = 0.13
constraints              = h-bonds
tcoupl                   = v-rescale
tc-grps                  = system
tau-t                    = 0.5
ref-t                    = 300
pcoupl                   = C-rescale
ref-p                    = 1
compressibility          = 4.5e-5
tau-p                    = 1
gen-vel                  = yes
gen-temp                 = 300
!gmx grompp -f equil.mdp -c em.gro -o equil.tpr
!gmx mdrun -deffnm equil

Here you can use the mdrun option -ntmpi 1 to specify only using one mpi rank since the system is so small. For larger systems more mpi ranks would be appropriate. For very small systems such as this one, it would also be possible to use -nt 1 to specify only using one processor. You may want to adjust the number of cores used, but given the small system size, even one core should be sufficient to complete the run in about a minute. After this step we should be ready with a hopefully equilibrated configuration of ethanol in water, saved in equil.gro.

Question: You should check whether the system has been equilibrated. How could you do this?

Note: if you do not want to wait, but go on with the tutorial, copy the data from the /reference directory into the output_files directory. To do this from within this notebook, remove the comment characters (#) in the following cell

## ONLY execute the lines below if you do not want to run and wait for the simulations to finish.
## Before executing the command below, check that you are in output_files directory
#!cp ../reference/equil.edr ./
#!cp ../reference/equil.gro ./

Creating the \(\lambda\) points

Now the system preparation is complete and we are ready to do a production simulations. First, we need to define different \(\lambda\) points, each of which indicate the alchemical state along the transition between solvated and in vacuo states. We define those states in the mdp configuration file below with consecutive values of \(\lambda\). The free energy settings state the following: take the molecule ethanol, and couple it to our variable \(\lambda\). This is done so that \(\lambda=0\) means that the molecule is de-coupled, and \(\lambda=1\) means that the molecule is fully coupled. Take a look at the couple-lambda0 and couple-lambda1 options in the file. There, vdwq means that both Lennard Jones and Coulomb will be affected by the value of \(\lambda\), hence the solute molecule will be fully decoupled. What is decoupled is be selected by the couple-moltype option (ethanol in our case). The sc-power, sc-sigma and sc-alpha settings control the ‘soft-core’ interactions that prevent the system from having overlapping particles as it is being de-coupled. In this tutorial we use Beutler’s softcore developed by Beutler and coworkers.

A full description of the simulation parameters for free energy calculation in GROMACS can be found here.

fep_awh_mdp="""; we'll use the sd integrator (an accurate and efficient leap-frog stochastic dynamics integrator)
integrator               = sd
nsteps                   = 700000 ; equilivate to 100000 step for 7 lambda points
dt                       = 0.002
nstenergy                = 1000
nstcalcenergy            = 50 ; should be a divisor of nstdhdl
nstlog                   = 5000
; cut-offs at 1.0nm
rlist                    = 1.1
rvdw                     = 1.1
; Coulomb interactions
coulombtype              = pme
rcoulomb                 = 1.1
fourierspacing           = 0.13
; Constraints
constraints              = h-bonds
; set temperature to 300K
tc-grps                  = system
tau-t                    = 2.0
ref-t                    = 300
; set pressure to 1 bar with a thermostat that gives a correct
; thermodynamic ensemble
pcoupl                   = C-rescale
ref-p                    = 1.0
compressibility          = 4.5e-5
tau-p                    = 5.0

; and set the free energy parameters
free-energy              = yes
couple-moltype           = ethanol
nstdhdl                  = 50 ; frequency for writing energy difference in dhdl.xvg, 0 means no ouput, should be a multiple of nstcalcenergy.
; these 'soft-core' parameters make sure we never get overlapping
; charges as lambda goes to 0
; soft-core function
sc-power                 = 1
sc-sigma                 = 0.3
sc-alpha                 = 1.0
; we still want the molecule to interact with itself at lambda=0
couple-intramol          = no
couple-lambda1           = vdwq
couple-lambda0           = none
; here we provide the lambda values we want to start
init-lambda-state        = 12
calc-lambda-neighbors    = -1 ; A value of -1 means all lambda points will be written out
; NOTE for BAR a value of 1 is sufficient,
; while for MBAR or AWH -1 should be used.
; These are the lambda states at which we simulate
; for separate LJ and Coulomb decoupling, use
fep-lambdas              = 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.85 0.9 0.95 1.0
; awh setting
awh                        = yes
; we apply  potential bias by controlling the position of an
; harmonic potential using Monte-Carlo sampling.
awh-potential              = umbrella ;
awh-nstout                 = 50000 ; save frequency to energy file (multiple of nstenergy)
awh-nbias                  = 1  ; n. of bias
awh-nstsample              = 50 ; multiple of nstcalcenergy
awh-nsamples-update        = 10
; initial biasing rate is determited by awh1-error-init together with
; the given diffusion constant(s) awh1-dim1-diffusion
awh1-error-init            = 10.0 ;
awh1-dim1-diffusion        = 0.001 ; as high as possible (10^2 to 10^4 ps^(-1))
awh1-growth                = exp-linear ;
awh1-equilibrate-histogram = no ;
; bias tuned towards a constant (uniform) coordinate distribution
awh1-target                = constant ;
; awh1-user-data             = no
; Number of dimensions of the coordinate
awh1-ndim                  = 1
awh1-dim1-coord-provider   = fep-lambda  ; free energy lambda state = reaction coordinate
awh1-dim1-coord-index      = 1 ; might not be necessary with fep-lambda
awh1-dim1-start            = 0
awh1-dim1-end              = 12

Note a simple way of parallelizing sampling with AWH is to launch multiple simultaneous trajectories, so called “walkers”, and have them share a common bias and free energy estimate. This can reduce the length of the initial stage significantly. To enable multiwalker, use the option awh-share-multisim=yes and awh1-share-group = positive in the mdp file and gmx mdrun with the option -multidir. But this is not necessary for the system of our size. You may want to use more than one core though.

!gmx grompp -f fep_awh.mdp -c equil.gro -o fep_awh.tpr
!gmx mdrun -deffnm fep_awh

Now we have a look at the files we got:


Here we have obtained the information on all the lambda points one simulation of 1400 ps, while when we have run the solvation free energy tutorial using BAR, we have a directory for each lambda point and we have run 200 ps for each point.

Question: Did we use the same lambda points between the two approaches? and what is the difference in the total simulation time?

Note: if you do not want to wait, but look at some of the results directly, copy the data from the /reference directory into the output_files directory. To do this from within this notebook, remove the comment characters (#) in the following cell

## ONLY execute the lines below if you do not want to run and wait for the simulations to finish. Before executing the command below, check that you are in output_files directory
#!cp ../reference/fep_awh.edr ./

Post-processing: extracting the free energy

Data directly related to AWH is extracted using the GROMACS tool gmx awh. The data is stored in the energy file, ener.edr. The default output file name is of the format awh_t.xvg; there will be one file for each AWH output time, or less if you use the -skip flag. Let’s dump the output into a separate directory.

!ls awh-data
!mkdir -p awh-data
!gmx awh -quiet -s fep_awh.tpr -more -f fep_awh.edr -o awh-data/awh.xvg

Above, the -more flag tells gmx awh to extract more than the PMF from the AWH files, e.g. histograms. The default gmx unit of energy is kJ/mol, but with the flag -kT we get units of kT.

Let’s have a quick look at the contents of one of these files (from time 1000 ps). You can use xmgrace on your local machine to plot the file, by removing the comment character (#) in the following cell to do that.

For plotting and data analysis we provide a Python function to read the .xvg file format into this notebook.

import numpy as np
import matplotlib.pyplot as plt

# Function that reads data from an .xvg file
def read_xvg(fname):
    with open(fname) as f:
        for line in f:
            # Lines with metadata or comments start with #, @
            if not line.startswith(("@","#")):
                data.append(np.array([float(s) for s in line.split()]))
    data = np.vstack(data)
    return data
t=1000 # ps
# -nxy to plot multicolumn data
awh_data = read_xvg("awh-data/awh_t{}.xvg".format(t)).T

fig,ax = plt.subplots()

# uncomment the line below if you have xmgrace installed
# !xmgrace -free -nxy awh-data/awh_t{t}.xvg # wait for it...

Now we look closer at the time-evolution of the free energy (you can do the same for either variable in the awh.xvg file).

# A list of all AWH files
fnames = !ls awh-data/awh_t*xvg

# Extract time of each file
times = [float(re.findall('awh_t(.+?).xvg', fname)[0]) for fname in fnames]

# Sort the files chronologically
fnames = [f for (t,f) in sorted(zip(times, fnames))]
print("Number of files:", len(fnames))
print("Time in between files ", times[1] - times[0], 'ps')
print("Last time", times[-1], 'ps')

# Plot the PMF from first files/times
istart = 0 # Start plotting from this file index
nplot = 15  # Number of files to plot
for fname in fnames[istart:istart+nplot]:
    labels.append(re.findall('awh_t(.+?).xvg', fname)[0] + ' ps') # use the time as label
    plt.plot(data[:,0], data[:,1])

Try increasing the istart variable above to see how the free energy estimates are changing (less and less) over time. Note: this convergence does not easily translate into error estimates for the solvation free energy. To get such error bars the simplest is to run multiple (independent) AWH simulations and calculate standard errors from that.

Something that you may want to also check is the friction.

!gmx awh -fric awh-data/friction.xvg -b 1400 -f fep_awh.edr -s fep_awh.tpr -quiet
!ls -alt awh-data/friction*
t=1400 # ps
awh_data = read_xvg("awh-data/friction_t{}.xvg".format(t)).T

fig,ax = plt.subplots()

# uncomment the line below if you have xmgrace installed
# -nxy to plot multicolumn data
# !xmgrace -free awh-data/friction_t{t}.xvg # wait for it...

To obtain the solvation free energy we have to substruct the values at first lambda point (lambda=0) to the values at last lambda point (lambda=1) at time 1400 ps (last point in time).

!cat awh-data/awh_t1400.xvg
lambda_1 =    #copy here number from lambda point 12 - value in the 2nd column
lambda_0 =   #copy here number from lambda point 0 - value in the 2nd column
number = lambda_1 - lambda_0
print(number)       # solvation free energy in kJ/mol

Look for the final number above, it should be close to the experimental value. With appropriate sampling you should get a value of approximately -19.1 +/- 0.3 kJ/mol (see for example J. Phys. Chem. B 2006).

Question: you can run multiple (independent) AWH simulations and calculate standard errors for the free energy. To run indipendent simulations you can for example use different starting velocities distribution. Try it. What is the standard error? How could it be improved? Also you can try to longer runs to get better convergency.

Question: Why can there sometimes be a significant (larger than the standard error) difference between the experimental result and the simulation result? How could this be improved?

## Topology example for relative free energy of solvation As example we take the alchemical transformation from ethanol to ethanethiol in water solution. Below we look at the topology file that describes the transformation from ethanol (state A) to ethanethiol (state B

#Before executing the command below, check that you are in output_files directory
!cat ../input_files/ethanol_Sethane.top

Note that atom mass has to be kept constant between state A and B for technical reason. That will not affect your free energy difference

## Where to go from here After calculating the free energy of solvation, we’ve solved the first part of the free energy of binding of the earlier equations. The second part involves coupling a molecule into (or out of) a situation where it is bound to a protein. This introduces one additional complexity: we end up with a situation where a weakly coupled ligand wanders through our system: weak-couple.png which is bad because this is a poorly reversible situation: there are suddenly very few states that map from a weakly coupled to a more strongly coupled molecule, which will drastically reduce the accuracy of the free energy calculation.

This situation can be remedied by forcing the ligand to stay at a specific position relative to the protein. This can be done with the GROMACS ‘pull code’, which allows the specification of arbitrary forces or constraints onto with respect to centers of mass of any chosen set of atoms onto any other group of atoms. With a pull type of ‘umbrella’, we can specify that we want a quadratic potential to this specified location, forcing the ligand to stay at its native position even when it has been fully de-coupled.

One way find out where to put the center of the force is by choosing a group of atoms in the protein close to the ligand, and doing a simulation with full ligand coupling, where the pull code is enabled, but with zero force. The pull code will then frequently output the coordinates of the ligand, from which an average position and an expected deviation can be calculated. This can then serve as a reference point for the location of the center of force for the pull code during the production runs, and the force constant of the pull code.

Once the free energy has been calculated, care must be taken to correct for the fact that we have trapped our molecule. This can easily be done analytically.

Optional Question: Given a measured standard deviation in the location of the center of mass of our ligand, how do we choose the force constant for the pull code?

Optional Question: How do we correct for using the pull code: what is the contribution to the free energy of applying a quadratic potential to a molecule?