Research Guides: Data Management and Sharing Plans: Data Documentation & Organization

Data Documentation

Data documentation, also known as metadata, helps you understand your data in detail, and also helps other researchers find, use, and properly cite your data.

Various metadata standards are available for particular file formats and disciplines. A ReadMe.txt file, a Codebook, or Coding Manual should be created to accompany your data files. As you collect or create your data, you want to capture the following information:

Make a note of all file names and formats associated with the project, how the data is organized, how the data was generated (including any equipment or software used), and information about how the data has been altered or processed.
Include an explanation of codes, abbreviations, or variables used in the data or in the file naming structure.
Keep notes about where you got the data so that you and others can find it.

Things to document about your data:

Title: Name of the dataset or research project that produced it
Creator: Names and addresses of the organization or people who created the data
Identifier: Number used to identify the data, even if it is just an internal project reference number
Dates: Key dates associated with the data, including project start and end date, data modification data release date, and time period covered by the data
Subject: Keywords or phrases describing the subject or content of the data
Funders: Organizations or agencies who funded the research
Rights: Any known intellectual property rights held for the data
Language: Language(s) of the intellectual content of the resource, when applicable
Location: Where the data relates to a physical location, record information about its spatial coverage
Methodology: How the data was generated, including equipment or software used, experimental protocol, other things you might include in a lab notebook

Data Documentation Resources

Data dictionaries, ReadMe.txt, and Codebooks are all ideal ways of documenting your data. ReadMe.txt files provide information about your data files and help ensure that your data files can always be correctly interpreted by anyone using them. Data dictionaries are often used to describe each element of your dataset - the variable names and values in your spreadsheets. Codebooks are more detailed than data dictionaries and might include information that is in a ReadMe.txt file, as well as describing elements of your dataset, and the instruments used to gather the data (surveys, interview questions).

The following resources provide additional information on how to create these documents:

Guide to writing "readme" style metadata
Cornell's Research Data Management Service Group explains how to create a readme file, which provides information about a data file and is intended to help ensure that the data can be correctly interpreted, by yourself at a later date or by others when sharing or publishing data.
README.txt
Blog post from Kristin Briney, author of Data Management for Researchers, discusses the benefits of README files, their origins, and provides examples.
How to Make a Data Dictionary
Open Science Framework (OSF) explains how to create a data dictionary. The purpose of a data dictionary is to explain what all the variable names and values in your spreadsheet really mean.
Codebook Cookbook (PDF)
A guide to writing a good codebook for data analysis projects in medicine. Writing a codebook is an important step in the management of any data analysis project. The codebook will serve as a reference for the clinical team; it will help newcomers to the project to rapidly have a flavor of what is at stake and will serve as a communication tool with the statistical unit.
How to Share Data with a Statistician
The goals of this guide are to provide some instruction on the best way to share data to avoid the most common pitfalls and sources of delay in the transition from data collection to data analysis. Advice includes writing a codebook.
ICPSR Guide to Codebooks (PDF)
Guide from ICPSR on creating codebooks with examples included.

Data Organization

Data organization includes having a consistent folder and file structure, along with using sustainable file formats and having an established file naming convention. If you need a refresher on file naming conventions and sustainable file formats please visit the previous page: Data File Naming & Management.

File Structure

When organizing your data you want to use a consistent file structure so that you'll always be able to find your files. Your ReadMe.txt file should record the file structure you decide on for your project in additional to your other data documentation (file name conventions, abbreviations, variables, etc). The ReadMe.txt file should be located at the very top of your file structure hierarchy so it is easy to locate.

Create separate folders for your raw data, processed data, code and outputs, and documentation to avoid confusion. All file names should follow the file naming convention you have established.

File Structure Resources

There are several protocols that can be followed for structuring your files.

TIER Protocol from Open Science Framework. The OSF includes a clonable template of their TIER Protocol that can create a new hierarchy of folders to match their specifications. It works well with OSF in addition to GitHub, Google Drive, and DropBox.
Reproducible Science template is a more complicated file structure that follows Cookiecutter Data Science. It provides a standardized, flexible project structure for conducting and sharing data.