Many of you involved in structure based drug discovery will know very well about the numerous problems and errors in the data found in the Protein Data Bank (PDB) especially concerning the ligand structures. There have been a lot of publications about such errors, e.g. in Jones et al. J Mol. Biol. (1997) 267:727, and I heard various conference presentations about this topic too, e.g. by Gerard Kleywegt (University of Uppsala), titled “Protein crystallography: not as simple as ABC then?” at Bryn Mawr, Philadelphia (15-19 October 2007) eChemInfo meeting. The errors are often blamed on the low resolution of the structures involving large protein structures (often thousands of atoms). One would assume that the small molecule crystal structures of the Cambridge Structural Database (CSD) do not have such errors, since they have much higher resolution and dealing with small molecules. Let me correct that wrong assumption!
The scoring function of our eHiTS docking software relies of statistics of interaction patterns. Earlier we have collected such statistics from thousands of PDB files — also considering the Gaussian distributions of the atom coordinates based on the given temperature factors to account for the uncertainty in the data. In the past year we have collected some statistics from the CSD with the hope to improve the accuracy of our scoring function by using more reliable, more precise data. Unfortunately, we had to learn the hard way, that the CSD isn’t so clean either. We have found a lot of obvious errors, like some atom centers falling within 0.2 Angstrom or less from each other when the crystal packing transformations are applied, some completely impossible bond lengths and angles. We kept adding sanity checks to report and exclude data entries with various obvious errors. At the end of the automated cleaning process, we had almost 15% of the data dropped for one reason or another. Then we thought the remaining data is good, we can use it for collecting the statistics.
Now, the refined scoring function is nearing completion and we are running various tests. One of the tests was to compute the internal strain energy of various ligand structures, minimize the conformations from a systematic set of sampling conformations to identify global minima based on the new scoring function. This is an important exercise, part of the protein-ligand binding energy estimation problem, as Ashutosh Jogalekar blogged about it today. Yesterday, one of our developers Bashir Sadjad presented me some data he collected running these minimizations on a few CSD structures. An intriguing point he raised was, that a several structures have shown very high strain energies that could be resolved with fairly small dihedral changes. Of course, you cannot expect the CSD structures to be all at global minimum conformation, because there are interactions in the crystal lattice that may force some compromise to reach a better H-bond or other interaction. However, I was expecting them to be at least at or near a local minimum conformation. Then Bashir has pulled out one of the worst examples where the X-ray structure had very high strain energy: CSD code [REDACTED] has two carboxylate groups
as shown on the image [REDACTED]. The original structure from the CSD is was displayed with thick bonds and the optimized one has thin bonds, you can see the optimization has twisted the two carboxylate out of the plane of the aromatic ring in order to avoid two lone pair facing each other. When I saw the image I immediately said: this must be a protonation error, because it looks like to me that if one of the carboxylates is protonated towards the other, then instead of a bad clash, you would have a good H-bond between the two. Are there H atoms in the original ? It would make perfect sense to have the original non-twisted conformation in that case, but if they are lone pairs with negative charge on both carboxylates, then it is very likely that they would twist out of plane to avoid each other. Bashir said, the structure did have H atoms, but NOT on the carboxylates, each oxygen appears de-protonated. OK, then I do not get it, there must be an error I thought. Even with the N in the ring protonated, both carboxylate cannot be deprotonated, because the whole structure would have a -1 formal charge, which is impossible in a crystal — there is no salt in the lattice to counter balance the charge, so the molecule must be neutral to form the crystal.
Today, Bashir came back with the explanation to the puzzle, he said:
This case was really annoying and I could not convince myself that it is only our scoring function that assigns the huge score to it, so I looked at the publication for this. Just looking at the figures, they all have a hydrogen between the two oxygens. In fact the title talks about C7H5NO4 while there are only 4 hydrogens in the original mol2 file! Finally, I looked carefully at the CIF file and in a ’special_details’ part it says:
“H5 bridges O2 and O3 with almost equal distances. H5 is not retained”
So, actually, it is not our scoring function that is wrong but the CSD entry!
So, the morale of the story: we can’t even trust the high resolution CSD data, let alone the PDB.
Since the posting of this blog entry, we have received 2 public comments — displayed in a standard way as all comments by the WordPress blog software, and some private emails originating from CCDC. One of the complaints from CCDC was that the second comment — which explains the problems and directs the blame on my naivity for my wrong expectations about the data — was not displayed as prominently as the original article. To correct this problem, I will quote the entire comment here:
These comments are interesting - not because they reveal anything that a small molecule crystallographer doesn’t know: More because they reveal that modellers have expectations of information that are a bit naive.
I’m a long time user of the CSD and, as a small molecule crystallographer, I understand the caveats behind crystallographic data. There are errors in the published crystal structures, and not all of them will get spotted during peer review or data curation. CSD users would be well advised to try to understand them and factor them in to their work. I particularly turn your attention to Points 3 and 4 below …
H-positions are sometimes hard to resolve in small molecule studies, and need to be treated warily in crystal structures. Ok - the entry QUICNA01 is a neutron study, so one would expect them to be better, but disordering is an issue.
One should always look at both the 2D and 3D structural information when working with crystal structures. If you look at the 2D representation in the CSD for QUICNA01 it is correct.
Undiagnosed disorder/symmetry can lead to problems: There are structural studies in the CSD where the crystallographer has missed a disorder, or missed some symmetry. AACRUB is an example of missed symmetry - and when you look at the study you see rather dubious bond lengths and angles, due to correlations in the refinement co-variance matrix.
Quite often, when this sort of thing happens, a later study will then correct the error: see AACRUB01 in this case.
Missed dis-orderings and symmetry are hard to spot, note: This is by far the most likely thing to trip up a modeller who ‘just wants the coordinates’.
Newer structural studies are more likely to be more reliable than older studies due to enormous improvements in equipment and software to undertake the studies. I think, in the case of QUICNA01 this is very pertinent. The structure was published in 1974 …. Ok - if this is the only structure then you may have to use it but ….
If there are several similar studies of a structure, they end up in a CSD refcode family. In the case of QUICNA01 you also have some later studies - namely QUICNA02 QUICNA03 QUICNA10 QUICNA11 QUICNA12 QUICNA13
QUICNA10, QUICNA11, QUICNA12 and QUICNA13 are all later studies of the structure, and they *all* have the proton to which you refer, since they are ‘deuterated’ compounds which will resolve better in neutron studies.
Now - you might quite reasonably say ‘but how do I know which one to pick?’ - There is this study
Though admittedly for QUICNA note that the choice is inconclusive based on the 4 lists given: I think the hydrogen list may not account for deuteration.
The other main point raised was, that our CCDC license has expired since the data collection was made, therefore we can no longer use any data — even derived data — from the CSD. We certainly fully obey this cease and desist order and will not use any of the data — we have not made any publications containing data from CSD except for this blog entry (and I have now removed the code name and the image to comply with the order) and none of the released versions of our software containes such data either. By the way, the data did not help us improve the scoring function anyway, partly due to the fact that similar errors occur in the data as in the PDB and the PDB data is more relevant to docking, because of the constraints present in the protein environment.
On a personal opinion: such restrictions on the use of scientific facts do not seem to make much sense to me. As the IUCr position paper explains: There is a long-standing acceptance within crystallography of the principle that such primary data sets should be freely available for sharing and re-use (with appropriate credit) within the structural science community. Also the FAQ on the CystalEye site explains: “As this supplementary data is a set of facts and is not part of the article full-text it does not fall under the copyright, and it should therefore be free to both view and download“. Nevertheless, CCDC has the legal right to stop us from using the data, since we signed a licensing agreement containing such conditions. That was a mistake on our part, one that we have to live with now. Let this case be a warning for others who have not yet made such mistake to sign the draconian agreement.