Public apology to CCDC

My previous post about errors in crystal structures have triggered strong reactions from CCDC (not only response post, direct email, but also email to my former PhD supervisor in the UK asking him for remedy and explanation). Apparently, they have interpreted my post as an attack on the quality of their services. Let me clarify first, that I have never intended to imply anything negative or derogatory about the CCDC services or software. My sincere apologies if my post came across that way. All I wanted to do is raise awareness in the docking/scoring community that small molecule crystallographic data is not free of errors. My understanding is, that the data deposited in CSD has been determined by thousands of people all over the world and published in various scientific journals, while CCDC aggregates the data and creates a comprehensive, validated and value-added database known as the Cambridge Structural Database (CSD), and the complete CSD System (CSDS) includes the CSD itself and associated software for search, visualisation and analysis of stored information. I acknowledge that CCDC provides a valuable service to the community and any error in the data is not their fault.

They have also sent us a “friendly reminder” that since our license to CSD has expired, according to the signed agreement we are not allowed to retain or use any data downloaded form CSD, not even any derived information or data. As I already stated in the update added to the previous blog entry, we have ceased using any data derived from CSD to comply with the license. I have even removed the image of the molecule from the post (since that can also be considered as derived data). We have not incorporated any data into our software. As I mentioned in the previous post, we had the intention to improve our scoring function with statistics collected from CSD (while we had the license during 2007), but it did not prove to be useful, therefore we abandoned that approach and continue to use publicly available PDB crystal structure data — which has been used for all released version of the software. We have not renewed our (rather pricey) license for 2008 for this reason.
One lesson I learned from this exchange is the importance of Open Data for scientific advancement (some scientists believe that research data must be free), e.g. such that is available from CrystalEye. When even non-profit organizations (registered as charity) use draconian license agreements protecting data created and published by others, then fully commercial entities (like pharmaceutical companies) must be guarding their own data even stronger. It makes it difficult to make scientific progress if a single blog mention of an error in a data entry invites the wrath of the company who sells services on the data. It is ironic that the links expressing the need for open data and the open repository happens to point to a web site within the same University where CCDC resides.

6 Responses to “Public apology to CCDC”

  1. Vladimir Says:

    It’s terrible to hear! Shame on CSD!

    Once I hear the story from my chief that many molecules from the CSD were determined by supervision of one of the great Russian scientist from the So my next question was - what is the legal status of the data? Because all of the data was derived from the literature published by the corresponding author. Chief said: That is the property of the CSD. I was greatly amazed.

    The close question arise according to the QSAR datasets eg

    Syracuse Research, PHYSPROP database - if you by it you can’t use it for QSAR research (you can’t sell derived QSAR models). It might be the example of greedy data owners :)

  2. ChemSpider Blog » Blog Archive » When a Scientific Blog Posting, Data Licensing and Open Data Access Come Together Says:

    […] Those of you who have been watching the discussion between myself and ACS over the past few months will know I have been trying to get confirmation that “supplementary data” are Open Data and that we could scrape the CIFs if we chose to…it’s a MANY month conversation at this point. The Unilever School at Cambridge, via Nick Day’s work, has generated CrystalEye and, after many conversations, we were provided the data source and have it on ChemSpider now. We are awaiting constructive feedback from Nick and Peter Murray-Rust regarding our implementation of their data on our site. THis is especially important when there are licnesing issues as appear to have been enforced on SimBioSys, evidenced by this Public Apology to CCDC. Read the post for details. […]

  3. zsolt Says:

    I am glad to see that CrystalEye is not the only competition offering open, free access to crystallographic data, but such data is now also available on ChemSpider:

    ChemSpider has a community based curation process, so anybody who spots problems, errors can contribute to the correction, improvement of the data. Thi s is a very powerful process, similar to the Open Source development model of Linux, Apache, Mozilla and OpenOffice. I hope ChemSpider will prove to be just as successful and the data quality will surpass the closed, restricted, proprietary alternative. We will definitely consider it as a source of data for future efforts to improve our scoring and prediction models, and we will contribute to the curation with any errors we may find.


  4. Jim Downing Says:

    ChemSpider is a great aggregator and data overlay of chemistry information, and it’s great that they’ve indexed CrystalEye. CrystalEye has it’s own (rudimentary as yet) community based curation process (; we want the CrystalEye data to be as error free as possible. You’ve drawn my attention to the fact that we haven’t yet got a mechanism for exchanging curation information between the two. Hopefully we’ll be able to work on this over the summer.

    I’d also draw your attention to eCrystals: We’re part of the eCrystals3 project, aiming to establish a federation of open crystallographic data.

    eCrystals federation project:

  5. Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Update and emphasis on publishing and Crystallography Says:

    […] SimBioSys, and others have blogged about the availability of data from the CCDC (post). This raises some extremely serious points about the closing of data meant to be public. […]

  6. will Says:

    The tools being developed by SimBioSys, once fully developed, will put the curation of complex chemical data possible for many scientists not just a few.

    Imagine the possibilities of combining Chemical OCR software with a Google-type indexer (another technology recently now usable by those with few resources).

    Even worse (or better in my opinion), these technologies threaten to expose the errors in traditional databases which have prided themselves on high quaility but have not been sufficiently independently scrutinised for mistakes due to their closed nature, and software is heartless in reporting these once the doors are open.

Leave a Reply