Aggregated and Integrated Data Sets


Many datasets are the product of the aggregation of data from multiple sources and often multiple parties. For example, the Open PHACTS platform ( provides integrated access to over 10 different databases. These databases, for example, Chembl and Uniprot, are amalgamations of data extracted both automatically and via human curation from other sources such as the literature. These databases in turn rely on data models and ontologies developed by still others. For example, the GO or Chebi ontologies. Additionally, integrators may slightly change datasets through format changes or addition of links.

  • Issue 1: The provenance or credit chain of a single answer given by a data integration platform can be much bigger than the answer itself. How do we correctly ensure credit is given to all actors in the system? Furthermore, how do we ensure that these chains can be effectively traced back?

An additional aspect of such integrations is that they are all under context flux. Note that, Uniprot is released every 4 weeks. Chembl is released quarterly, and some such as SureChembl are hourly. How does a data integrator appropriately capture and expose this information? Currently, this is often done by providing versioned data dumps. However, this may not be allowed or supported in many cases due to licensing, technical or policy issues.

  • Issue 2: How do we cite developing data sets originating from multiple sources?​







