News & Blog

🔬 Transform your computational chemistry!


Taming the Wild West of Data Management: 4 Tips for Organizing Your Dataset List


Managing a list of datasets as a researcher can be a bit of a challenge, especially when it comes to citing them properly. Unlike a bibliography of publications, there are currently very few tools available to help with this task. In these early days of data management, it can be a bit complicated to keep track of all the datasets you have produced, want to use or are interested in.

One approach to organizing your datasets is to collect their doi identifiers from publicly accessible repositories such as OpenAIRE. There is a convenient CSV export functionality, for instance. You can then use a code like the doiclient python tool contributed by Jonathan Barnoud to retrieve the metadata for this doi list. It uses the nice Crosscite citation formatter. From there, you can extract for instance a bibtex bibliography of all your datasets. With such a bibliography you can then use tools such as pybtex to format the metadata into markdown or html for inclusion on your website.

One potential difficulty you may encounter is with figshare, which is a popular platform for sharing datasets. Many datasets on figshare do not have their own doi, only the doi of the publication they refer to. This can make it difficult to properly process and cite these datasets.

It would be great if there were a data management software with a catalog similar to the ones we have for publications, such as Zotero, but more specific to data. By that I mean the ability for instance to dynamically update, in case there is a new version of the dataset, and also not duplicate different versions of a given dataset for instance. Unfortunately, such a tool does not seem to exist yet, but it would certainly be a welcome addition to the data management landscape.

In the meantime, it is important to do your best to properly cite and organize your datasets. If you have any feedback or suggestions on how to improve this process, please don't hesitate to share it. Here is a link to the new datasets page on my website, where you can see the results of my efforts.

In summary, I recommend to

  1. collect and use doi identifiers for your datasets whenever possible. Maybe a few alternative identifiers such as figshare id, as well, in some cases

  2. automate the treatment of your dois with existing tools such as doiclient or crosscite that allow you e.g. to retrieve a bibtex bibliography of your data

  3. use tools such as pybtex to manage the bibtex conversion to any desired format, including html and markdown

  4. keep a lookout for a data reference management tool that would simplify and streamline these tasks

Comments