Correct protonation state for docking

In the previous post I briefly touched the question of protonation in docking and promised to come back with more details at a later time. Well, it turns out I could not go to sleep tonight until I get this out of me :-) So here it comes…

Several of the protein residues and many of the ligand functional groups may exist in different protonation states. In a docking study, the protonation states of the protein residues in the active site and of the functional groups of the ligand are very important. Most of the PDB structures obtained from X-ray crystallography do not contain the protons corresponding to the Hydrogen atoms because the resolution of the X-ray data is not fine enough to “see” the protons in the electron density map. On the other hand, most docking software requires the input files to be protonated. Therefore, some software technique (e.g. modeling packages) have to be used to add the protons. The ligand structures are often taken from a database, that may store the structures in a very compact form, e.g. SMILES string, and then protons are added again by some software prior to docking. Many simple software tools will generate the neutral form of the small molecules if possible, e.g. see the picture of aspartic acidAspartic acid on ChemSpider or Wikipedia, (although ChemSpider also has the de-protonated charged form — aspartate as a separate entry). Unfortunately, the neutral form is not the best choice for docking as this residue is more likely to be in the negatively charged de-protonated form in most proteins under in vivo conditions. But, please do notice the wording I used : “more likely” and “most”. Of course, there are lots of exceptions. Due to resonance, the role of the two oxygen atoms of the carboxylic acid are also symmetrical, so the H can be on either of them and in 3D on 2 possible positions on each oxygen.

OK, you say, enough of preaching to the choir, of course you are always using protonation states appropriate for physiologycal pH, so you are fine. Unfortunately, it is not that simple. There are many examples, e.g. metalloenzymes that will require the use of different protonation states, see the paper cited in the previous post. Thanks to the ZINC team, now you have a chance to download ligands in a protonation state specific to certain pH values. Some modeling software will also allow you to protonate your protein according to a given pH value. Hurray, you say, now we are good to go we just have to decide the appropriate pH value for the given study and we are all set. No, I am afraid, it is more complicated than that. Let’s look at an example 3D aspartates in HIV-1snapshot from the center of the active site of HIV-1 protease. You see two carboxylates two Asp residues nicely facing each other at very close contact distance between the oxygens. The only way this is possible if there is a proton between them making a nice Hydrogen-bond, otherwise the two lone pairs of the oxygens (with strong partial negative charges on both) would be VERY uncomfortable to say the least. So, one of them is protonated, the other isn’t. Uh, oh! This isn’t good for the pH rule. Whatever value you chose, it fails, because we have the same residue behaving differently right next to each other.

There are even nastier cases. Let’s look at the Serine protease family: trypsin, thrombin, factor Xa. The key Ser-195 residue in the active site attacks the carbon of a peptide, cleaving the amide bond, because its alcohol group has “lost” the proton — now an alcohol group is not such a horribly strong acid to lose its proton so easily, so there must be some really basic conditions in that active site — one might assume. Well, not really, because the natural substrate is the Arginine side chain which must be in a protonated form to be attracted to the negatively charged Asp-189 at the bottom of the pocket (that long-range attraction “lures” the Arg into the pocket for the perfect position for the Ser-195 to attack the peptide bond at the neck of the residue. The key to this mystery is the Asp-102, His-57, Ser-195 catalytic triad. The mechanism is nicely explained here. The lesson to be learned regarding protonation states is that the local environment is the key. Global notions like pH only work in a nice uniform water solution with millions of copies of a given ligand and some specific ions floating around. In the binding site of a protein there are lots of local constraints that determine the interactions that occur.

Let’s pay a bit more attention to the middle player of that triad: the Histidine side chain. This residue on its own is another reason why pH is not sufficient to determine the protonation states. The two Nitrogen atoms of the aromatic ring are usually drawn with different connectivity (one has a double bond the other does not), but in fact due to the resonance of the aromatic ring, the role of the two is completely interchangeable (due to two different tautomer forms). At physiologycal pH one of them should be protonated (H-bond donor) the other should have a lone electron pair (H-bond acceptor) sticking out. However, which is which cannot be decided based on pH. Yet, such decisions are crucial for recognizing (scoring) the correct binding pose of a ligand. Sometimes, the intra-molecular H-bonding possibilities within the receptor can help to decide the correct state, e.g. in the case of the above mentioned catalytic triad (the Asp-102 dictates that the H must be on that side of the His-57).

Unfortunately, even full analysis of the receptor site isn’t enough in some cases. There are situations where some receptor residues change protonation states depending on which ligand binds to it, e.g. HIV-1 protease see this paper. Of course, the ligand protonation state may also be altered by the active site upon binding. There is no such thing as “correct” protonation state for a given receptor site without considering the ligand and also there is no correct protonation state for a ligand either without taking into account its interactions in the active site. This means, that the only correct treatment of protonation states is to leave the question open until the end of the docking process and make the decision separately for each result docking pose, e.g. using this method for the full complex. Of course, changing protonation states isn’t always free, usually it involves an energetic change, e.g. a carboxylate “prefers” to be de-protonated (lower energy form) — exceptions to this are the Histidine flipping the H between the two N, and a protonated carboxylate moving around the proton in the 4 possible positions on the two oxygens. So, a perfect scoring function must be able to consider various protonation states and also consider the energy cost associated with switching between them. The last release version of eHiTS (6.2) does the first part, i.e. it samples the protonation states on the fly for the result poses — independently for each functional group. The second part (energy cost of various protonation states) will be incorporated in the next release coming up this summer.

protonation example

For other docking software that do not perform on the fly protonation changes, the only correct way to execute a screening study is to prepare multiple copies of each ligand enumerating all feasible protonation states (increasing the size of the screening library substantially, 150X for the example on the figure above), and also prepare multiple copies of the receptor with all combination of active site residue protonation states and perform a separate screening batch for each and select the best scoring solutions from all runs. This process raises the CPU time — possibly by several orders of magnitude. So, it is much more efficient to use eHiTS, or better yet eHiTS Lightning ;-)

An update in response to the comments by Andy and Danni:

Let me summarize how I see the process of correct protonation state determination and scoring:

  1. The correct docking pose has to be generated — possibly among several enumerated.
  2. The correct protonation state needs to be determined for each functional group involved — for each pose separately
  3. The energy has to be estimated correctly considering not only the interactions but also the cost of protonation state change relative to the lowest energy form.

Most docking programs will fail already at step 1 if the input receptor/ligand molecules are not in the right protonation state, because they will not put two donors or two acceptors facing each other at close interaction distance — and for those mentioned cases where the protonation dynamically changes upon binding as part of the recognition process, the input molecules will NOT be in the correct protonation state. In contrast eHiTS uses ambivalent flags for indicating the possibility of H/lp (donor/acceptor) variations and will generate all the possible poses — including the correct one.
Step 2 can still be done without QM calculations — it is a discrete optimization problem that can be solved optimally, for example using the method described by Paul Labute at the CCG site. So, now we have multiple poses, each of them with all the involved FG having the appropriate protonation state for the given pose. All this done on the fly in the docking process efficiently. What is missing now from eHiTS is step 3.
Step 3 is what requires the QM calculations to be able to estimate the energy of each pose accurately and select which one is the correct — lowest energy pose. In my opinion what we need to do ahead of time with QM is to compute two tables: the hydrogen bonding strengths between various functional groups and the internal energy difference between different protonation states of a given FG. Using these pre-computed values, we can estimate the total energy of the system. The question is, how big is the error in the estimation due to the use of pre-cooked FG values as opposed to computing the energy of the whole system at a high level of theory ? I do not know the answer to this question,and therefore I am not yet sure how accurate our estimate will be, but soon we will be able to see the results.

6 Responses to “Correct protonation state for docking”

  1. Egon Willighagen Says:

    One can make educated guesses of what the protonation state is, one can also apply some QM-based approach to calculate pKa’s… Clearly, it is important… I don’t think the CDK currently has an algorithm to do this…

    Is there a gold standard; a good training set to set approaches on? As you already said, crystallography is not going to help. Databases with experimental pKa database are often proprietary… Any suggestions?

  2. zsolt Says:

    Egon, yes, you used the right phrase “educated guesses”, even QM methods will do just that if applied in isolation to the ligand and receptor. My point is, there are tough cases, like HIV-1 and kinases that tend to switch protonation states (both ligand and receptor) upon binding, so for a given ligand what is correct for binding to one receptor is incorrect for binding to another. That is why preparation methods (no matter how sophisticated) are not good enough. You need to analyze each binding pose AFTER docking and make the decision then. But if the docking engine does not work with ambivalent states then it will never place the ligand in a colliding (proton-to-proton or lone-pair to lone-pair) pose so you never find the correct pose (unless you run multiple docking with all possibilities enumerated - very expensive).

    As for standard benchmark to test on, use the DUD dataset:
    Of course, that will still not tell you the right protonation state for each pose, but at least you know which ligands are supposed to bind and which ones (although similar) are not.
    It would be very good to have a test set of PDB complex structures with correctly assigned protonations for some HIV-1 and kinase series. I do not know about the existence of such (PDB-bind has structures and binding affinities but the protonation states were not treated well).
    One set I know that has been at least eye-balled is the Astex diverse set:
    Although it is not focused on the protonation problem.

  3. Andy Good Says:

    Agree completely. This is further exemplified by extending the disuccsion on serine proteases and the catalytic triad histidine. Not only are multiple neutral states possible, in the case of HCV protease the histidine is capable of protonating to form a salt bridge with an acidic ligand. This speaks further to your point regarding protein protonation states being influenced on the fly by different ligands.
    Whether one can predict these states on the fly is open to debate. Perhaps it would be possible to probe each ambiguous residue with acidic / neutral / basic groups using QM or the like to determine ahead of time which states are energetically acessible before the on the fly screen process begins. It seems unlikely that a purely empircal approach without such an initial callibration would be very accurarate, though I stand ready to be corrected :-)

  4. Danni Harris Says:

    I agree with all above and echo Andy’s concerns about the problem complexity”

    In addition to the complexity of deducing the ‘correct’ complementary protonation states of ligand and polar/acidic amino acid sidechains in the binding site cavity one has in some instances:

    1)variable conformations of acidic/polar residues in the substrate/inhibitor binding site (e.g. E297 in CYP2C5 crystal structures : 1dt6, 1n6b, 1nr6) and variable water occupancies proximate to those acid residues,

    2) connectivity of the ligand and surrounding polar/acidic binding side residues to bulk solvent water via extended hydrogen bond networks (sometimes low-barrier hydrogen bond networks),

    3) variable electrostatic influence of enzymatically active species–> that differ with stage in the enzymatic cycle (e.g. P450s).

    Bottom line: Even when you can frame it with QM/MM and include environment you have to frame the issue for the correct structure —(amino acids + H2O’s) and this may vary with both ligand pose and ligand identity. So doing what you are doing ‘on the fly’ ZZ is a big step. The next step ‘filtering’ based on thermodynamic/energetic criteria will be a significant challenge.

    That being said—let’s keep struggling with the problem! :)

  5. zsolt Says:

    Andy and Danni,
    I have added an update to the post in response to your comments. Bottomline is, we can generate the right poses and the right protonation states in eHiTS already — among other (incorrect) poses with their protonation states. What we need is an accurate QM-supported scoring and ranking of the solutions — we are working on this, stay tuned…


  6. ChemSpider Blog » Blog Archive » Protonation States and Their Effect on Docking Says:

    […] Zsolt came searching ChemSpider for the amino acid structures and found the complexities in terms of charged/stereo forms. But his postinf regarding the “Correct Protonation State for Docking” was an education in itself. If you are engaged in docking experiments at all this is likely a must read. For the rest of us neophytes it’s education! addthis_url = ‘’; addthis_title = ‘Protonation+States+and+Their+Effect+on+Docking’; addthis_pub = ‘’; […]

Leave a Reply