.. code:: ipython3 # Execute this first # # * trigger notebook styling # * check if notebook had been modified since its distribution # # Note: executing any cells before this modifies the notebook. # %run src/init_notebooks.py hide_toggle() check_notebook() Molecular dynamics simulation of a small protein using GROMACS ============================================================== :: authors : Alessandra Villa (based on a Justin Lemkuhl tutorial see http://www.mdtutorials.com or Living J. Comp. Mol. Sci. 2018, 1, 5068). goal : learn step-by-step how to run a molecular dynamics simulation of a small protein using GROMACS time : 90 minutes software : GROMACS 2023, python modules: numpy, matplotlib, re, nglviewer, md_traj, panda. optional software: visualization software [VMD](https://www.ks.uiuc.edu/Research/vmd), Xmgrace plotting tool tutorial source: tutorials.gromacs.org version : release Preparations to run this notebook ================================= .. code:: ipython3 # Change to the data directory # Note that executing this command twice will result in an error you can ignore %cd data Obtaining the input for a simulation ==================================== The starting point for each simulation is a molecular structure file. For this tutorial, we will utilize Factor Xa, a protein playing critical role in the formation of blood clots. The 3D structure is available from the RCSB website, https://www.rcsb.org/ with PDB code 1FJS. You can find the PDB file for the crystal structure in “input” directory as “1fjs.pdb”. Now we visualize the structure .. code:: ipython3 import nglview as ng view = ng.show_structure_file("input/1fjs.pdb") view # click and drag to rotate, zoom with your mouseweel # for more infor on this viewer have a look at https://github.com/nglviewer/nglview In alternative, you can use VMD to visualize the structure on your local machine. To run VMD from within this notebook, remove the comment character (#) in the following cell and VMD should pop up: **``close the VMD window after you are done looking at the protein to continue with this notebook``** .. code:: ipython3 #!vmd input/1fjs.pdb Cleaning the input structure ---------------------------- Once you’ve had a look at the molecule, you are going to want to strip out all the atoms that do not belong to the protein (e.i crystal waters, ligands, etc). To delete those atoms (labelled “HETATM” in the PDB file) and eventually their connectivity, either use a plain text editor like vi, emacs (Linux/Mac), or Notepad (Windows). Do not use word processing software! Alternatively, you can use grep to delete these lines very easily: .. code:: ipython3 !grep -v HETATM input/1fjs.pdb > 1fjs_protein_tmp.pdb !grep -v CONECT 1fjs_protein_tmp.pdb > 1fjs_protein.pdb View the cleaned structure .. code:: ipython3 import nglview as ng view = ng.show_structure_file("1fjs_protein.pdb") view # click and drag to rotate, zoom with your mouseweel # for more infor on this viewer have a look at https://github.com/nglviewer/nglview In alternative, you can use VMD to visualize the structure on your local machine. To run VMD from within this notebook, remove the comment character (#) in the following cell and VMD should pop up: .. code:: ipython3 #!vmd 1fjs_protein.pdb Note *Such a procedure is not universally appropriate (e.g., the case of a tightly bound ligand or otherwise functional active-site water molecule).* Always check your .pdb file for entries listed under the comment MISSING, as these entries indicate either atoms or whole residues that are not present in the crystal structure. Terminal regions may be absent, and may not present a problem for dynamics. .. code:: ipython3 !grep MISSING input/1fjs.pdb Generating a topology --------------------- Now we have verified that all the necessary atoms are present and the PDB file contains only protein atoms, and is ready to be input into GROMACS (see GROMACS documentation ). The first GROMACS tool, we use, is ```gmx pdb2gmx`` `__. The purpose of ``gmx pdb2gmx`` is to generate three files: - The topology for the molecule. - A position restraint file. - A post-processed structure file. The topology (topol.top by default) contains all the information necessary to define the molecule within a simulation. This information includes nonbonded parameters (atom types and charges) as well as bonded parameters (bonds, angles, dihedrals and atom connectivity). We will take a more detailed look at the topology once it has been generated. Note *Incomplete internal sequences or any amino acid residues that have missing atoms will cause ``gmx pdb2gmx`` to fail. These missing atoms/residues must be modeled in using other software packages. Also note that ``gmx pdb2gmx`` is not magic. It cannot generate topologies for arbitrary molecules, just the residues defined by the force field (in the*.rtp files - generally proteins, nucleic acids, and a very finite amount of cofactors, like NAD(H) and ATP).\* Execute ``gmx pdb2gmx`` by issuing the following command: .. code:: ipython3 !gmx pdb2gmx -f 1fjs_protein.pdb -o 1fjs_processed.gro -water tip3p -ff "charmm27" Here, we made an important decision for the course of the simulation in choosing the CHARMM27 all-atom force field. The force field will contain the information that will be written to the topology. This is a very important choice! You should always read thoroughly about each force field and decide which is most applicable to your situation. Other choices are given, when running ``gmx pdb2gmx`` without the -ff flag :: Select the Force Field: From '/usr/local/gromacs/share/gromacs/top': 1: AMBER03 protein, nucleic AMBER94 (Duan et al., J. Comp. Chem. 24, 1999-2012, 2003) 2: AMBER94 force field (Cornell et al., JACS 117, 5179-5197, 1995) 3: AMBER96 protein, nucleic AMBER94 (Kollman et al., Acc. Chem. Res. 29, 461-469, 1996) 4: AMBER99 protein, nucleic AMBER94 (Wang et al., J. Comp. Chem. 21, 1049-1074, 2000) 5: AMBER99SB protein, nucleic AMBER94 (Hornak et al., Proteins 65, 712-725, 2006) 6: AMBER99SB-ILDN protein, nucleic AMBER94 (Lindorff-Larsen et al., Proteins 78, 1950-58, 2010) 7: AMBERGS force field (Garcia & Sanbonmatsu, PNAS 99, 2782-2787, 2002) 8: CHARMM27 all-atom force field (CHARM22 plus CMAP for proteins) 9: GROMOS96 43a1 force field 10: GROMOS96 43a2 force field (improved alkane dihedrals) 11: GROMOS96 45a3 force field (Schuler JCC 2001 22 1205) 12: GROMOS96 53a5 force field (JCC 2004 vol 25 pag 1656) 13: GROMOS96 53a6 force field (JCC 2004 vol 25 pag 1656) 14: GROMOS96 54a7 force field (Eur. Biophys. J. (2011), 40,, 843-856, DOI: 10.1007/s00249-011-0700-9) 15: OPLS-AA/L all-atom force field (2001 aminoacid dihedrals) There are many other options that can be passed to ``gmx pdb2gmx`` (see http://manual.gromacs.org/documentation/current/onlinehelp/gmx-pdb2gmx.html). Some commonly used ones are listed here: +-----------------------------------+-----------------------------------+ | Option | Effect | +===================================+===================================+ | -water | Water model to use: none, spc, | | | spce, tip3p, tip4p, tip5p, | | | tips3p. | +-----------------------------------+-----------------------------------+ | -ignh | Ignore H atoms in the PDB file; | | | especially useful for NMR | | | structures. Otherwise, if H atoms | | | are present, they must be in the | | | named exactly how the force | | | fields in GROMACS expect them to | | | be. Different conventions exist, | | | so dealing with H atoms can | | | occasionally be a headache! If | | | you need to preserve the initial | | | H coordinates, but renaming is | | | required, then the Linux sed | | | command is your friend. | +-----------------------------------+-----------------------------------+ | -ter | Interactively assign charge | | | states for N- and C-termini. | +-----------------------------------+-----------------------------------+ | -inter | Interactively assign charge | | | states for Glu, Asp, Lys, Arg, | | | and His; choose which Cys are | | | involved in disulfide bonds. | +-----------------------------------+-----------------------------------+ A peek at the generated files ----------------------------- .. code:: ipython3 !ls You have now generated three new files: 1fjs_processed.gro, topol.top, topol_Protein_chain_X.itp and posre_Protein_chain_X.itp. 1fjs_processed.gro is a GROMACS-formatted structure file that contains all the atoms defined within the force field (i.e., H atoms have been added to the amino acids in the protein). The topol.top file is the system topology (more on this in a minute). The posre files contain information used to restrain the positions of heavy atoms (more on this later). One final note: many users assume that a .gro file is mandatory. This is not true. GROMACS can handle many different file formats, with .gro simply being the default for commands that write coordinate files. It is a very compact format, but it has limited precision. If you prefer to use, for instance, PDB format, all you need to do is to specify an appropriate file name with .pdb extension as your output. The purpose of ``gmx pdb2gmx`` is to produce a force field-compliant topology; the output structure is largely a side effect of this purpose and is intended for user convenience. The format can be just about anything you like (see http://manual.gromacs.org/documentation/current/reference-manual/file-formats.html for different formats). Understanding molecule “topologies” =================================== Let’s look at what is in the output topology (topol.top). Again, using a plain text editor, inspect its contents. After several comment lines (preceded by ``;``), you will find the following: .. code:: ipython3 !cat topol.top The first line calls the parameters within the chosen force field. It is at the beginning of the file, indicating that all subsequent parameters are derived from this force field. Then next important line is [ moleculetype ], that you find in .. code:: ipython3 !grep "moleculetype" -A 3 topol_Protein_chain_A.itp and .. code:: ipython3 !grep "moleculetype" -A 3 topol_Protein_chain_L.itp The name “Protein_chain_A” defines the molecule name, based on the fact that the protein was labeled as chain A in the PDB file. There are 3 exclusions for bonded neighbors. More information on exclusions can be found in the GROMACS manual. Atoms in a topology ------------------- The next section defines the [ atoms ] in the protein. The information is presented as columns: .. code:: ipython3 ! grep "atoms" -A 4 topol_Protein_chain_A.itp The interpretation of this information is as follows: +-----------------------------------+-----------------------------------+ | Field | description | +===================================+===================================+ | nr | Atom number | +-----------------------------------+-----------------------------------+ | type | Atom type | +-----------------------------------+-----------------------------------+ | resnr | Amino acid residue number | +-----------------------------------+-----------------------------------+ | residue | The amino acid residue name- Note | | | that this may be different from | | | .rtp entry. | +-----------------------------------+-----------------------------------+ | atom | Atom name | +-----------------------------------+-----------------------------------+ | cgnr | Charge group number - Not used | | | anymore | +-----------------------------------+-----------------------------------+ | charge | Self-explanatory - The “qtot” | | | descriptor is a running total of | | | the charge on the molecule | +-----------------------------------+-----------------------------------+ | mass | Also self-explanatory | +-----------------------------------+-----------------------------------+ | typeB, chargeB, massB | Used for free energy perturbation | | | (not discussed here) | +-----------------------------------+-----------------------------------+ Bonds and other interactions ---------------------------- Subsequent sections include [ bonds ], [ pairs ], [ angles ], and [ dihedrals ]. Some of these sections are self-explanatory (bonds, angles, and dihedrals). These interactions are found `here `__ and special 1-4 interactions are included under `pairs `__. For format associated to function types `see topology file table `__ in GROMACS manual. Below we look for these interactions in topology file. As example we start with chain A. Bonded interactions .. code:: ipython3 !grep "bonds" -A 2 topol_Protein_chain_A.itp Pair interactions .. code:: ipython3 !grep "pairs" -A 2 topol_Protein_chain_A.itp Angle interactions .. code:: ipython3 !grep "angles" -A 2 topol_Protein_chain_A.itp Dihedral interactions .. code:: ipython3 !grep "dihedrals" -A 2 topol_Protein_chain_A.itp Water and position restraints ----------------------------- The remainder of the file involves including a few other useful/necessary topologies, starting with position restraints. The “posre.itp” file was generated by ``gmx pdb2gmx``; it defines a force constant used to keep atoms in place during equilibration (more on this later). .. code:: ipython3 !grep "posre" topol*.itp .. code:: ipython3 !cat posre_Protein_chain_A.itp Here ends the “Protein_chain_A” moleculetype definition. Then the “Protein_chain_L” moleculetype definition starts. The remainder of the topology file is dedicated to defining other molecules and providing system-level descriptions. The next moleculetype (by default) is the solvent, in this case TIP3P water. Other typical choices for water include SPC, SPC/E, and TIP4P. We chose this by passing “-water tip3p” to ``gmx pdb2gmx``. For an excellent summary of the many different water models, click `here `__ , but be aware that not all of these models are present within GROMACS. Ions and other parameters ------------------------- Ion parameters are included next: .. code:: ipython3 !grep "ions" topol.top System level definitions ------------------------ Finally come system-level definitions. The [ system ] directive gives the name of the system that will be written to output files during the simulation. The [ molecules ] directive lists all of the molecules in the system. .. code:: ipython3 !tail -8 topol.top A few key notes about the ``[ molecules ]`` directive: - The order of the listed molecules must exactly match the order of the molecules in the coordinate (in this case, .gro) file. - The names listed must match the ``[ moleculetype ]`` name for each species, not residue names or anything else. If you fail to satisfy these concrete requirements at any time, you will get fatal errors from grompp (discussed later) about mismatched names, molecules not being found, or a number of others. Now that we have examined the contents of a topology file, we can continue building our system. Solvating the simulation system =============================== Now that you are familiar with the contents of the GROMACS topology, it is time to continue building our system. In this example, we are going to be simulating a simple aqueous system. It is possible to simulate proteins and other molecules in different solvents, provided that good parameters are available for all species involved. There are two steps to defining the box and filling it with solvent: - Define the box dimensions using the ```gmx editconf`` `__ tool. - Fill the box with water using the ```gmx solvate`` `__ tool. You are now presented with a choice as to how to treat the unit cell. For the purpose of this tutorial, we will use the rhombic dodecahedron, as its volume is ~71% of the cubic box of the same periodic distance, thus saving on the number of water molecules that need to be added to solvate the protein. Defining the simulation box --------------------------- Let’s define the box using ``gmx editconf``: .. code:: ipython3 !gmx editconf -f 1fjs_processed.gro -o 1fjs_newbox.gro -c -d 1.0 -bt dodecahedron The above command centers the protein in the box (-c), and places it at least 1.0 nm from the box edge (-d 1.0). The box type is defined as a rhombic dodecahedron (-bt dodecahedron). Have a look at this `section `__ in the manual for more on periodic boundary conditions and dodecahedrons. The distance to the edge of the box is an important parameter. A protein should never interact with its periodic image (minimum image convention), otherwise the forces calculated will be spurious. The minimum image convention implies that the distance between two periodic images of the protein should be larger than the cut-off radius used to truncate non-bonded interactions. Here we will use a cut-off radius of 1.2 nm (see below). Specifying a solute-box distance of 1.0 nm will mean that there are at least 2.0 nm between any two periodic images of a protein. We expect that at that distance all protein-protein interactions are negligible. Filling the box with water -------------------------- .. code:: ipython3 !gmx solvate -cp 1fjs_newbox.gro -cs spc216.gro -o 1fjs_solv.gro -p topol.top The configuration of the protein (-cp) is contained in the output of the previous ``gmx editconf`` step, and the configuration of the solvent (-cs) is part of the standard GROMACS installation. We are using spc216.gro, which is a generic equilibrated 3-point solvent model box. You can use spc216.gro as the solvent configuration for SPC, SPC/E, or TIP3P water, since they are all three-point water models. The output is called 1fjs_solv.gro, and we tell solvate the name of the topology file (topol.top) so it can be modified. Note the changes to the [ molecules ] directive of topol.top: .. code:: ipython3 !tail topol.top ``gmx solvate`` kept track of how many water molecules it has added, which it then writes to your topology to reflect the changes that have been made. Note that if you use any other (non-water) solvent,\ ``gmx solvate`` will not make these changes to your topology! Its compatibility with updating water molecules is hard-coded. Adding Ions =========== We now have a solvated system that contains a charged protein. The output of ``pdb2gmx`` told us that the protein has a net charge of -2e (based on its amino acid composition). If you missed this information in the ``pdb2gmx`` output, look at the last line of each [ atoms ] directive in topology file; it should read “qtot 1.” for chain A and “qtot -3.” for chain L. Since life does not exist at a net charge, we must add ions to our system. Further, we aim to approximate physiological conditions and use therefore a NaCl concentration of 0.15 M. Preparing the input for “gmx genion” ------------------------------------ The tool for adding ions within GROMACS is called ```gmx genion`` `__. What ``gmx genion`` does is read through the topology and replace water molecules with the ions that the user specifies. The input is called a run input file, which has an extension of .tpr; this file is produced by the GROMACS grompp module (GROMACS pre-processor), which will also be used later when we run our first simulation. ``grompp`` processes the coordinate file and topology (which describes the molecules) to generate an atomic-level input (``.tpr``). The .tpr file contains all the parameters for all of the atoms in the system. To produce a .tpr file with ```gmx grompp`` `__, we will need an additional input file, with the extension .mdp (molecular dynamics parameter file); ``gmx grompp`` will assemble the parameters specified in the .mdp file with the coordinates and topology information to generate a .tpr file. An .mdp file is normally used to run energy minimization or an MD simulation, but in this case is simply used to generate an atomic description of the system. We can proceed with an completely empty .mdp file in this case, its only role is to create the .tpr file. .. code:: ipython3 !touch ions.mdp Assemble your .tpr file with the following: .. code:: ipython3 !gmx grompp -f ions.mdp -c 1fjs_solv.gro -p topol.top -o ions.tpr Be aware that there are some ``NOTE`` in the output. In all other cases than generating an output ``tpr`` for ``gmx genion`` these are very important and may not be ignored, here however we only need the atomic-level description of our system in the binary file ions.tpr. We will pass this file to genion: *Note: A lot of GROMACS tools promt for input at the command line. As this does not play well with the jupyter notebook setup, we feed the input to the command line with a print command and a newline :raw-latex:`\n `substituting enter* .. code:: ipython3 !printf "SOL\n" | gmx genion -s ions.tpr -o 1fjs_solv_ions.gro -conc 0.15 -p \ topol.top -pname NA -nname CL -neutral We chose group “SOL” for embedding ions. You do not want to replace parts of your protein with ions. *Note: Make sure to run ``gmx genion`` only once. ``gmx genion`` edits the topology “in-place” and does not know if there are already Cl or Na ions in the system. If you run it over again, ions will be added until there is no water left to replace with ions.* In the ``gmx genion`` command, we provide the structure/state file (-s) as input, generate a .gro file as output (-o), process the topology (-p) to reflect the removal of water molecules and addition of ions, define positive and negative ion names (-pname and -nname, respectively), and tell ``gmx genion`` to add ions necessary to neutralize the net charge on the protein by adding the correct number of negative ions (-neutral, which in this case will add 2 Na+ ions to offset the -2 charge on the protein). We further use genion to add a specified concentration of ions in addition to simply neutralizing the system by specifying the -neutral and -conc options in conjunction. Refer to the genion man page for information on how to use these options. The names of the ions specified with -pname and -nname GROMACS are standardized and not dependent on the force-field. The specified ion names are always the elemental symbol in all capital letters, which is the [ moleculetype ] name that is then written to the topology. Residue or atom names may or may not append the sign of the charge (+/-), depending on the force field. Do not use atom or residue names in the genion command, or you will encounter errors in subsequent steps. Your [ molecules ] directive should now be similar to :: [ molecules ] ; Compound #mols Protein_chain_A 1 Protein_chain_L 1 SOL 11703 NA 38 CL 36 Let’s check the last lines of the topology file .. code:: ipython3 !tail -6 topol.top If you see multiple lines of NA and CL, you most likely ran genion multiple times. Start over with the ``pdb2gmx``. Energy minimisation =================== The solvated, electroneutral system is now assembled. Before we can begin dynamics, we must ensure that the system has no steric clashes or inappropriate geometry. The structure is relaxed through a process called energy minimization (EM). To perform energy minimization, we are once again going to use ``gmx grompp`` to assemble the structure, topology, and simulation parameters into a binary input file (.tpr), then we will use GROMACS MD engine, mdrun, to run the energy minimization. Assemble the binary input using ``gmx grompp`` using the .mdp parameter file, as input. The simulation parameter file (.mdp) determines how the simulation shall be run. Find more information on all the options in the `manual `__ or in the following `webinar `__. There are a lot of parameters that can be set, here we ony set the elemental parametes and leave everything else as default. Let’s have a look at the input file: .. code:: ipython3 !cat input/emin-charmm.mdp .. code:: ipython3 !gmx grompp -f input/emin-charmm.mdp -c 1fjs_solv_ions.gro -p topol.top -o em.tpr Make sure you have been updating your topol.top file when running genbox and genion, or else you will get lots of nasty error messages (“number of coordinates in coordinate file does not match topology,” etc). Once run, we will find the energy-minimized structure in a file called ``em.gro``. Additionally to this we will find more information on the run in an ASCII-text log file of the EM process, ``em.log``, a file for storage of energy, ``em.edr`` and a binary full-precision trajectory ``em.trr``. We are now ready to run mdrun to carry out the energy minimisation: .. code:: ipython3 !gmx mdrun -v -deffnm em The -v flag is for the impatient: it makes ``gmx mdrun`` verbose, such that it prints its progress to the screen at every step. Never use this flag if run in background, on a HPC center or on a local cluster, it prints unnecessary data in the standard output file. The -deffnm flag will define the file names of the input and output. So, if you did not name your grompp output “em.tpr,” you will have to explicitly specify its name with the ``gmx mdrun`` -s flag. *Note: if you do not want to wait, but look at some of the results directly, copy the data from the /reference directory into the current directory. To do this from within this notebook, remove the comment characters (#) in the following cell* .. code:: ipython3 ## ONLY execute the lines below if you do not want to run and wait for the simulation to finish #!cp reference/em_charmm.edr em.edr #!cp reference/em_charmm.gro em.gro Determining if the run was successful ------------------------------------- There are two very important factors to evaluate to determine if EM was successful. The first is the potential energy (printed at the end of the EM process, even without -v). Epot should be negative and (for a simple protein in water) on the order of 100000 kJ/mol, depending on the system size and number of water molecules The second important feature is the maximum force, Fmax, the target for which was set in minim.mdp - “emtol = 1000.0” - indicating a target Fmax of no greater than 1000 kJ/(mol nm). It is possible to arrive at a reasonable Epot with Fmax > emtol. If this happens, your system may not be stable enough for simulation. Evaluate why it may be happening, and perhaps change your minimization parameters (integrator, emstep, etc). Analysing the run results ------------------------- Let’s do a bit of analysis. The em.edr file contains all of the energy terms that GROMACS collects during EM. You can analyze any .edr file using the GROMACS energy module. To analyse or visualize simulation data in python or jupyter notebooks, we can output a simplified xvg format from gmx-analysis tools with the option ``-xvg none`` .. code:: ipython3 !printf "Potential\n0\n" | gmx energy -f em.edr -o potential.xvg -xvg none You will be shown the average of Epot, and a file called “potential.xvg” will be written. To plot this data file, you can use the script below. The resulting plot should show a steady convergence of Epot. .. code:: ipython3 import pandas as pd df = pd.read_csv('potential.xvg', sep='\s+', header=None, names=['step','energy']) df.plot('step') In alternative you can plot data file using Xmgrace plotting tool by directly output xvg format from the analysis tools. To use xmgrace from this notebook, remove the comment character (#) in the following cells: .. code:: ipython3 #!printf "Potential\n0\n" | gmx energy -f em.edr -o potential.xvg To plot this data file, now you will need the Xmgrace plotting tool. .. code:: ipython3 #!xmgrace potential.xvg Now that our system is at an energy minimum, we can begin real dynamics. Position restraints =================== EM ensured that we have a reasonable starting structure, in terms of geometry and solvent orientation. To begin real dynamics, we must equilibrate the solvent and ions around the protein. If we were to attempt unrestrained dynamics at this point, the system may collapse. The reason is that the solvent is mostly optimized within itself, and not necessarily with the solute, and ions are randomly placed by replacing water molecules. Remember that posre.itp file that pdb2gmx generated a long time ago? We’re going to use it now. The purpose of posre.itp is to apply a position restraining force on the heavy atoms of the protein (anything that is not a hydrogen). Movement is permitted, but only after overcoming a substantial energy penalty. The utility of position restraints is that they allow us to relax our solvent and ions around our protein, without the added variable of structural changes in the protein. The origin of the position restraints (the coordinates at which the restraint potential is zero) is provided via a coordinate file passed to the -r option of grompp. Depending from the protein and ion types, this process may also be in the order nanoseconds. To use position restraints we need to add "define = -DPOSRES" to the simulation parameter file, .mdp, (more details on the simulation parameters below). Have a look at the .mdp file for this run: .. code:: ipython3 !head -5 input/nvt-charmm.mdp When using position restraints, a file with restraint coordinates must be supplied with -r to gmx grompp (see below). It can be the same file as supplied for -c.  Equilibration run - temperature =============================== EM ensured that we have a reasonable starting structure, in terms of geometry and solvent orientation. Now the system needs to be brought to the temperature we wish to simulate and establish the proper orientation about the solute (the protein). After we arrive at the correct temperature (based on kinetic energies), we will apply pressure to the system until it reaches the proper density. Equilibration is often conducted in two phases. The first phase is conducted under an NVT ensemble (constant Number of particles, Volume, and Temperature). This ensemble is also referred to as “isothermal-isochoric” or “canonical.” The timeframe for such a procedure is dependent upon the contents of the system. Typically, 100-200 ps should suffice, and we will conduct a 100-ps NVT equilibration for this exercise. Depending on your machine, this may take a while (just under an hour if run in parallel on 16 cores or so). We will call grompp and mdrun just as we did at the EM step, but this time with the energy minimised structure as input and a different .mdp file for the run. Have a look at the input file for this run: .. code:: ipython3 !cat input/nvt-charmm.mdp Instead of energy tolerance, we now give time a step size and a number of steps. Furthermore, we need to set a temperature. Now, we’re good to run. Take note of a few parameters in the .mdp file: - ``gen_vel = yes``: Initiates velocity generation. Using different random seeds (gen_seed) gives different initial velocities, and thus multiple (different) simulations can be conducted from the same starting structure. - ``tcoupl = V-rescale``: The velocity rescaling thermostat is an improvement upon the Berendsen weak coupling method, which did not reproduce a correct kinetic ensemble. - ``pcoupl = no``: Pressure coupling is not applied. A full explanation of the parameters used can be found in the GROMACS `manual `__, in addition to the comments provided. Now, we’re good to run. .. code:: ipython3 !gmx grompp -f input/nvt-charmm.mdp -c em.gro -r em.gro -p topol.top -o nvt.tpr !gmx mdrun -ntmpi 1 -v -deffnm nvt Your computer should be working at full speed now to finish this simulation; it should be done in 5 to 10 minutes. *Note: if you do not want to wait, but look at some of the results directly, copy the data from the /reference directory into the current directory. To do this from within this notebook, remove the comment characters (#) in the following cell* .. code:: ipython3 ## ONLY execute the lines below if you do not want to run and wait for the simulation to finish #!cp reference/nvt_charmm.edr nvt.edr #!cp reference/nvt_charmm.gro nvt.gro #!cp reference/nvt_charmm.cpt nvt.cpt Let’s analyze the temperature progression, again using gmx energy: .. code:: ipython3 !echo "Temperature" | gmx energy -f nvt.edr -o temperature.xvg -xvg none -b 20 .. code:: ipython3 import pandas as pd df = pd.read_csv('temperature.xvg', sep='\s+', header=None, names=['time','temperature']) df.plot('time') Or for alterantive visualization, remove the comment characters (#) in the following cells: .. code:: ipython3 #!echo "Temperature" | gmx energy -f nvt.edr -o temperature.xvg -b 20 .. code:: ipython3 #!xmgrace temperature.xvg From the plot, it is clear that the temperature of the system quickly reaches the target value (300 K), and remains stable over the remainder of the equilibration. For this system, an equilibration period (on the order of 50 ps) may be adequate. Equilibration run - pressure ============================ The previous step, NVT equilibration, stabilized the temperature of the system. Prior to data collection, we must also stabilize the pressure (and thus also the density) of the system. Equilibration of pressure is conducted under an NPT ensemble, wherein the Number of particles, Pressure, and Temperature are all constant. The ensemble is also called the “isothermal-isobaric” ensemble, and most closely resembles experimental conditions. The .mdp file used for a 100-ps NPT equilibration can be found here. It is not drastically different from the parameter file used for NVT equilibration. Note the addition of the pressure coupling section. Berendsen barostat is used for this equilibation phase. .. code:: ipython3 !cat input/npt-charmm.mdp A few other changes: - ``continuation = yes``: We are continuing the simulation from the NVT equilibration phase - ``gen_vel = no``: Velocities are read from the trajectory (see below) We will call grompp and mdrun just as we did for NVT equilibration. Note that we are now including the -t flag to include the checkpoint file from the NVT equilibration; this file contains all the necessary state variables to continue our simulation. To conserve the velocities produced during NVT, we must include the final coordinate file (output) of the NVT simulation using the option (-c). .. code:: ipython3 !gmx grompp -f input/npt-charmm.mdp -c nvt.gro -r nvt.gro -t nvt.cpt -p topol.top -o npt.tpr Now we’ll start an MD simulation just like before - this one will take a few minutes; that’s why it is commented out. Go ahead run it if you want to generate your own data. .. code:: ipython3 !gmx mdrun -ntmpi 1 -v -deffnm npt If the simulation takes too long to run for this tutorial, you may use the provided data by copying it to the current directory. But remember that you can always check/analyse the output files while the simulation is still running. .. code:: ipython3 ##ONLY execute the lines below if you do not want to run and wait for the simulation to finish #!cp reference/npt_charmm.edr npt.edr #!cp reference/npt_charmm.gro npt.gro #!cp reference/npt_charmm.cpt npt.cpt Let’s analyze the pressure progression, again using energy: .. code:: ipython3 !echo "Pressure" | gmx energy -f npt.edr -o pressure.xvg -xvg none .. code:: ipython3 import pandas as pd df = pd.read_csv('pressure.xvg', sep='\s+', header=None, names=['time','pressure']) df.plot('time') Or for alterantive visualization, remove the comment characters (#) in the following cells: .. code:: ipython3 #!echo "Pressure" | gmx energy -f npt.edr -o pressure.xvg .. code:: ipython3 #!xmgrace pressure.xvg The pressure value fluctuates widely over the course of the 100-ps equilibration phase, but this behavior is not unexpected. The reference pressure was set to 1 bar, so is this outcome acceptable? Pressure is a quantity that fluctuates widely over the course of an MD simulation, as is clear from the large root-mean-square fluctuation (in the order of 100 bar), so statistically speaking, one cannot distinguish a difference between the obtained average and the target/reference value (1 bar). Let’s take a look at density as well using energy. .. code:: ipython3 !echo "Density" | gmx energy -f npt.edr -o density.xvg -xvg none .. code:: ipython3 import pandas as pd df = pd.read_csv('density.xvg', sep='\s+', header=None, names=['time','density']) df.plot('time') Or for alterantive visualization, remove the comment characters (#) in the following cells: .. code:: ipython3 #!echo "Density" | gmx energy -f npt.edr -o density.xvg .. code:: ipython3 #!xmgrace density.xvg The average value is close to the experimental value of 1000 kg m-3 and the expected density of the TIP3P model of 1001 kg m-3. The parameters for the TIP3P water model closely replicate experimental values for water. The density values are very stable over time, indicating that the system is well-equilibrated now with respect to pressure and density. Please note: Pressure-related terms are slow to converge, and thus you may have to run NPT equilibration slightly longer than is specified here. The “production” run ==================== Upon completion of the two equilibration phases, the system is now well-equilibrated at the desired temperature and pressure. We are now ready to release the position restraints and run production MD for data collection. The process is just like we have seen before, as we will make use of the checkpoint file (which in this case now contains preserve pressure coupling information) to grompp. We will run a 1-ns MD simulation. Note we have explictly add a section that controls the output frequency in log file (.log), energy file (.edr), in the trajcotry file (.trr) and the compress trajectory file (.xtc). .. code:: ipython3 !cat input/md-charmm.mdp Here, we use velocity-rescaling temperature coupling as thermostat and stochastic cell rescaling as barostat. A full explanation of the available thermostats and barostats in GROMACS can be found in the manual (see `here `__ for thermostat and `here `__ for barostat). Note, for thermostats and barostats we need to compute temperature or pressure. This requires global communication and is currently not done on GPU. To reduce these computational cost, `nsttcouple `__ and `nstpcouple `__ (frequency for coupling temperature/pressure) are set to 100 by default in GROMACS 2023 (see the webinar `What’s new in GROMACS2023 `__. The recommendation is to use tau_t = 1 ps for V-rescale and tau_p = 5 ps for C-rescale. ``gmx grompp`` will print an estimate for generated data and PME load. PME load will dictate how many processors should be dedicated to the PME calculation, and how many for the PP (everything except for PME) calculations. Refer to the `manual `__ for details. .. code:: ipython3 !gmx grompp -f input/md-charmm.mdp -c npt.gro -t npt.cpt -p topol.top -o md.tpr .. code:: ipython3 !gmx mdrun -ntmpi 1 -v -deffnm md As before, this run will take some time - probably more than planned for going through this tutorial. As an alternative, use the data provided here .. code:: ipython3 ##ONLY execute the lines below if you do not want to run and wait for the simulation to finish #!cp reference/md_charmm.log md.log #!cp reference/md_charmm.edr md.edr #!cp reference/md_charmm.gro md.gro #!cp reference/md_charmm.xtc md.xtc Analysis ======== Now that we have simulated our protein, we should run some analysis on the system. What types of data are important? This is an important question to ask before running the simulation, so you should have some ideas about the types of data you will want to collect in your own systems. For this tutorial, a few basic tools will be introduced. The first tool is ```gmx trjconv`` `__, which is used as a post-processing tool to strip out coordinates, correct for periodicity, or manually alter the trajectory (time units, frame frequency, etc). Here you find a suggested workflow for ``gmx trjconv`` `link `__. This exercise, we will use ``gmx trjconv`` to account for any periodicity in the system. The protein will diffuse through the unit cell, and may appear “broken” or may “jump” across to the other side of the box. To account for such actions, issue the following: .. code:: ipython3 !printf "1\n1\n" | gmx trjconv -s md.tpr -f md.xtc -o md_center.xtc -center -pbc mol Select 1 (“Protein”) as the group to be centered and 1 (“Protein”) for output. We will conduct all our analyses on this “corrected” trajectory. Let’s look at so obtained trajcetory using nglview. .. code:: ipython3 import nglview as ng import mdtraj as md traj = md.load("md_center.xtc", top="1fjs_newbox.gro") view = ng.show_mdtraj(traj) view In alternative, remove the comment character (#) to use VMD .. code:: ipython3 #!vmd 1fjs_newbox.gro md_center.xtc Note, the protein should never interact with its periodic image (minimum image convention), otherwise the forces calculated will be spurious. To calculate the distance between the protein and its periodic image, we use the tool ```gmx mindist`` `__ with the option -pi. .. code:: ipython3 !printf "1\n" | gmx mindist -s md.tpr -f md_center.xtc -pi -od mindist.xvg #!xmgrace mindist.xvg The distance between the protein and its periodic image should not be smaller than the cut-off used to describe non-bonded interactions. Now let’s look at structural stability. GROMACS has a built-in utility for RMSD calculations called ```gmx rms`` `__. To calculate RMSD relative to the crystal structure, issue this command: .. code:: ipython3 !printf "4\n1\n" | gmx rms -s em.tpr -f md_center.xtc -o rmsd_xray.xvg -tu ns -xvg none Choose 4 (“Backbone”) for both the least-squares fit and the group for RMSD calculation. The -tu flag will output the results in terms of ns, even though the trajectory was written in ps. This is done for clarity of the output (especially if you have a long simulation - 1ns does not look as nice as 100 ns). The output plot will show the RMSD relative to the structure present in the original pdb file: .. code:: ipython3 import pandas as pd df = pd.read_csv('rmsd_xray.xvg', sep='\s+', header=None, names=['time','RMSD']) df.plot('time') Or for alterantive visualization, remove the comment characters (#) in the following cells: .. code:: ipython3 #!printf "4\n1\n" | gmx rms -s em.tpr -f md_center.xtc -o rmsd_xray.xvg -tu ns .. code:: ipython3 #!xmgrace rmsd_xray.xvg The time series shows the RMSD levels off to ~0.15 nm (1.5 Å), indicating that the structure is stable. The option ``-h`` provides help information for GROMACS tool. .. code:: ipython3 !gmx rms -h Radius of gyration ------------------ The radius of gyration of a protein is a measure of its compactness. If a protein is stably folded, it will likely maintain a relatively steady value of Rg. If a protein unfolds, its Rg will change over time. Let’s analyze the radius of gyration for the protein in our simulation using GROMACS ```gmx gyrate`` `__ tool: .. code:: ipython3 !echo "1" | gmx gyrate -f md_center.xtc -s md.tpr -o gyrate.xvg -xvg none .. code:: ipython3 import pandas as pd df = pd.read_csv('gyrate.xvg', sep='\s+', header=None, names=['time','Rg'], usecols=[0, 1]) df.plot('time') Or for alterantive visualization, remove the comment characters (#) in the following cells: .. code:: ipython3 #!echo "1" | gmx gyrate -f md_center.xtc -s md.tpr -o gyrate.xvg .. code:: ipython3 #!xmgrace gyrate.xvg We can see from the reasonably invariant Rg values that the protein remains very stable, in its compact (folded) form over the course of 1 ns at 300 K. This result is not unexpected, but illustrates an advanced capacity of GROMACS analysis that comes built-in. Index file ---------- Index groups are necessary for almost every GROMACS tools. All GROMACS tools can generate default index groups. If one needs special index groups, he/she can use gmx `gmx make_ndx `__ to generate an index file (ndx). For example the command ``splitch 1`` splits the group 1 (Protein) in chains and the command ``q`` close the tool. .. code:: ipython3 !printf "splitch 1\nq\n" | gmx make_ndx -f nvt.tpr -o Now we can calculate the hydrogen bonds between the two protein chains using the tool `gmx hbond `__. The option -num provide the number of hydrogen bond as a function of time. .. code:: ipython3 !printf "17\n18\n"| gmx hbond -f md.xtc -s md.tpr -n index.ndx -num -xvg none .. code:: ipython3 import pandas as pd df = pd.read_csv('hbnum.xvg', sep='\s+', header=None, names=['time','H-bonds'], usecols=[0, 1]) df.plot('time') Or for alterantive visualization, remove the comment characters (#) in the following cells: .. code:: ipython3 #!printf "17\n18\n"| gmx hbond -f md.xtc -s md.tpr -n index.ndx -num .. code:: ipython3 #!xmgrace hbnum.xvg Report time ----------- Now that we have runned the simulation, we should report or recall about what type of simulation we have per formed. The GROMACS tool (``gmx report-methods``)[https://manual.gromacs.org/current/onlinehelp/gmx-report-methods. html] can be useful for this purpose. gmx report-methods print out basic system information on a performed run. It can also provide an unformatted text (with the option -o) or a LaTeX formatted output file with the option -m. Let’s try it. .. code:: ipython3 !gmx report-methods -s md.tpr Do you have any questions? Have a look at the user discussions on `GROMACS forums `__ ----------------------------------------------------------------------------------------------------------------- Excursion: Plotting gromacs data with “pandas” in python and notebooks ---------------------------------------------------------------------- To analyse or visualize simulation data in python or jupyter notebooks, we can output a simplified xvg format from gmx-analysis tools with the option ``-xvg none`` .. code:: ipython3 !echo "12\n0\n" | gmx energy -f em.edr -o potential.xvg -xvg none And here is the plot.. .. code:: ipython3 import pandas as pd df = pd.read_csv('potential.xvg', sep='\s+', header=None, names=['time','energy']) df.plot('time')