No scientist is an island: Working smarter to share data derivatives
In collaborative scientific research involving human subjects, the same datasets are often used by multiple researchers. Researchers frequently transform variables for their models or analyse, creating what are known as data derivatives. Tracking and sharing these derivatives can be valuable not only to the original research team, but also to collaborators and the broader scientific community.
Our work aims to help researchers produce accurate and reproducible data derivatives in an accessible manner. When data derivatives lack accuracy, reproducibility, and accessibility, researchers may end up duplicating efforts and coding errors may persist. We propose a workflows, best practices, and an example/template to help research teams manage their data derivatives using a derived data registry.
This work was developed at Neurohackademy 2024 by researchers in psychology and neuroscience for use with human subjects data, but may be adapted for use in other contexts. If you have questions or feedback, please reach out to us by creating an issue onGitHub.
Why is this important?
In common lab structures, when it comes to data derivatives efforts may be duplicated, data and knowledge lost when people leave, code isn’t necessarily reproducible by others or may be left unchecked, and sharing code and variables can be a massive effort that requires intensive code cleaning.
We believe that
- Transparently and systematically tracking data derivatives promotes efficiency, accuracy, and reproducibility in science. Being more systematic with the generation of data derivatives will save both you time and energy in the long run. Your future self will be grateful! Additionally, following or adapting this proposed workflow will make you a better scientist: your coding skills will improve, your analytic processes will be streamlined, and your results will be more reproducible.
- Contributing to a lab’s derived data registry can promote good lab citizenship and cohesion. You put in the work to create a derivative–now that effort can be paid forward! Using a derived data registry will prevent knowledge loss when lab members graduate and/or leave for other jobs. Through developing, testing, and reviewing the code, you will also help contribute to a lab culture valuing cooperation, kindness, learning and growth. A psychologically safe lab environment will accelerate learning and knowledge-sharing.
- Creating and maintaining a derived data registry fosters collaboration. Ever received an email from a collaborator asking about a variable you derived a few years ago? Maybe you’re able to find it somewhere deep in your analysis code, or maybe not… If you had contributed the variable and the script to a derived data registry, then you would be able to quickly find and send what your collaborator is asking for! This can help standardize data derivatives across large collaborations and consortia.
Who is this for?
Derived data registries were designed for a group of researchers working together from the same primary data source. However, this structure can also be adopted by people working on a dataset alone for organization and future code sharing.
The flexible structure of a derived data registry means that it can be created for a dataset, project team, lab, or even for collaboration teams across sites. Creating derived data registries at larger scales can help disseminate code for derived variables across datasets, but requires greater coordination to maintain.
Co-curation of derived data registries can help build a culture of collaboration, mutual accountability, and coordination. They may be especially helpful for research groups looking for more systematic ways to collaborate and share/review code.
Image reference
This image was created with the help of AI on Canva.