Open science. This term has been thrown around quite a lot lately. You have seen that hashtag on Twitter. But there is something odd about it, isn’t it? What is “open science”? “Open” as in open-minded, seeking new ideas? “Open” as in transparent, not lying, not hiding any failed experiments? “Open” as in open source, inviting everyone to contribute? Or “open” as in available, providing everyone with access to your data? But that would be tautology, wouldn’t it? Science already has all these qualities by definition… Doesn’t it?
Well, I hate to tell you this, but it is not always the case. I’ll skip the issue of the limited number of open-access journals and the fact that majority of research articles are behind paywalls, thus, making knowledge pricey if not unavailable for many. It is a topic for a separate post. Let’s focus on problems lying at the heart of science: the data.
The results section of a research paper often comprises its biggest part, however, raw or preprocessed data is rarely available. What’s the big deal? The final analysis graphs in a paper are just a tip of the research iceberg. Of course, we cannot include EVERYTHING in the paper – page limits actually ensure a reasonable amount of information to digest. Nonetheless, the raw data and detailed information about how it was generated and preprocessed is necessary to reproduce the results. And it has a number of other benefits:
1. Learning. Trying to reproduce the figures from the paper helps understanding how the analysis was done. It gives the insight into the scientific process for students and is an opportunity to work with real-life data.
2. Sanity-check. Having access to someone else’s data gives you the opportunity to verify if that weird spike in your data is an artefact or has it been observed before, but maybe was not discussed extensively in the literature. Running your analysis pipeline on some other data also helps to validate your analytical approach.
3. Tracing mistakes. Nobody’s perfect. We are all subject to making mistakes, but they will never be discovered if the transparent record of what has been done is not available.
4. Public transparency. Most of our research is funded by taxpayers. If everyone is paying for it, they deserve to be able to see the outcome of our work and use it.
5. Reanalysis. Another person interacting with your data might be able to focus on a different aspect of your data or use newly developed statistical method to address new questions.
6. Metaanalysis. Having access to data from many experiments belonging to the same category is an excellent opportunity for performing metaanalysis and making more robust conclusions.
Ok, so since publishing data has so many pros and is in line with the open science philosophy (which, in fact, is simply good scientific practice), why is it not a common thing to do in neuroscience?
The issue has both technological and psychological grounds. Raw data tends to be bulky. When you perform an experiment, you want to keep records of everything, most of it will be filtered out later on during the analysis, but you never know in advance which bit of information might be completely useless and which might be crucial for explaining an unexpected phenomenon.
In neuroscience it is of particular importance, as we often work with animals and we avoid at all cost having to repeat the experiment, because something was not properly documented. On top of that, neuroscientific data is often multidimensional (think EEG signal recorded simultaneously from the electrodes placed all over your head) and multimodal (think electrophysiological signal combined with the images and behavioural score of the genetically heterogeneous animals). Each lab uses different instruments and has their own policy of how to organise the data, often using custom written software for this purpose. This leads to technical issues of where to store such a volume of information and in what format.
And automatically it makes sharing problematic. Even if we had a good public database for storing the neuroscientific data, the lack of any universally standardised way of data management for years means researchers would have to go out of their way to convert the data to a compatible format. It might be daunting, especially given all other things piling up on their desks. There is also another psychological problem of being too attached to the data you collected. You put so much effort into your data collection, you perform very hard experiments, you want to take full credit for any findings coming out of it. I remember attending a discussion panel about NFDI initiative (see below) and public databases. One of the attendees asked: “So it means everyone will be able to look into and use my data?” Panelist: “Yeah, that’s the idea;” Attendee: “But… it’s MY DATA.” Such emotional connection is very limiting. If someone else is interested in using your data, it means your research has a greater impact and you will be cited in their publication, increasing your score on Google Scholar.
Some fields of biology, such as genomics, have dealt with these problems and have well organised public databases eg. GenBank, where DNA and RNA sequences are submitted together with information about the methodology. Such database offers a systematic pool of knowledge, a point of reference and a base for comparative analysis tools (see BLAST). What about neuroscience? The community started recognising the need for standardised ways of sharing methodology and data and there is already a plethora of ideas floating around. Let me introduce you to some of the initiatives.
National Research Data Infrastructure Germany (Nationale Forschungsdaten Infrastruktur; NFDI) is an association established in 2020 aiming to “create a permanent digital repository of knowledge”. The team of experts in neuro-, data- and computer science plan to introduce a carefully curated data storage system that would facilitate German researchers with data management and sharing “according to FAIR principles (Findable, Accessible, Interoperable and Reusable)”. NFDI Neuroscience does not offer any tools yet. In their publication they provide an overview of the issues I mentioned earlier and describe themselves as an open-community network that aims to build upon specific solutions already existing in the field. The modular approach in tool development would cater the diverse needs of neuroscientific community. On top of that, one vital part of the NFDI work, that is already in place, is education. They organise workshops, webinars and offer training in data management. It is critical, as part of the problem is the lack of such knowledge in the community.
Neurodata Without Borders
While NFDI is a great initiative and respects international collaborations, it is Germany-based and and we need more global solutions. EBRAINS is an infrastructure developed by Human Brain Project providing tools and a diverse database for neuroscientists. The database is organised around a knowledge graph and each submission must contain specific metadata. However, the format and structure of data files may vary. Neurodata Without Borders is an initiative that offers a tool for transforming your data into a unified format that is compatible with two most common frameworks used in neuroscience: Python and MatLab. Data is then submitted and freely avaialble on DANDI platform. Their first pilot project focused on neurophysiological data. Having everyone using the same standardised format would make reuse of data much easier, but at the same time requires effort to learn it and get used to it. Some big labs and institutes have already adopted the NWB format (eg. Allen Institute for Brain Science in their cell types database).
International brain laboratory
Having data deposited on widely accessible platforms not only has the benefit being available at hand to anyone who would want to question / verify / reuse / learn about your data, but it also facilitates collaborative efforts, such as International Brain Laboratory. IBL is a consortium fostering collaboration of many teams across the world attempting to unravel brain circuits involved in complex behaviours. To be able to validate their methods and understand the discrepancy between results, it is instrumental that they share the same standards for data collection, storage and analysis. In this way, they also tackle the issue of reproducibility of scientific results (see their publication on reproducible decision making paradigm in mice). Collaboration of 22 groups across 6 countries requires also full transparency in communication. They are being quite successful with their attempts, however, in their report they also highlight the challenges associated with their approach, namely a huge volume of data. And that is not just the experimental data, but also record of the whole process of collaboration and the lengthy paragraph about authorship that has to fit in the paper. On the other hand, such extensive documentation and awareness that many people interact with your bit of data increases the sense of responsibility and overall quality of work.
I presented only a few examples of what is currently being done to tackle the issues of open science, data management, collaboration and reproducibility in neurosceince. There are more undertakings, which is a positive sign of addressing this important issue by the community. However, none of these initiatives have taken off yet on a really big scale. I hope in the near future neuroscientists start using created tools, as standardisation and sharing is necessary for the field to progress.