Skip to Main Content

How to find and evaluate health datasets

What steps can I take to evaluate the credibility and usability of a dataset?

  1. Look for supporting documentation outlining what the data is, how it was collected, and how to interpret the data.
    1. Tip: Look for readme files, data dictionaries/codebooks, and a collection methodology
  2. Make sure you can open all files associated with the data.
  3. Ensure that all files are clearly labeled and store the information and/or data that is indicated in the file name.
  4. Within the data files, check for the following:
    1. Variables are clearly labeled with standard naming conventions.
      1. Example: First names are labeled as FirstName and last names are labeled as LastName
    2. Units of measurement for different variables are explicitly stated.
      1. Example: You can tell if measurements are given in centimeters (cm) or inches (in)
    3. Each variable contains a discrete unit of information.
      1. Example: blood pressure and zip code are stored in separate columns
    1. Variables follow data standards and have consistent formatting.
      1. Example: All dates are in yyyy-mm-dd format

MU Health Patient Populations

i2b2Allows researchers to search and download deidentified disease, procedure, and prescription data on patients treated at University of Missouri Healthcare. Maintained by the University of Missouri System’s NextGen Biomedical Informatics Center

**Note: i2b2 data is only available to students, staff, and faculty within the University of Missouri System. IRB approval is required to obtain identifiable patient information.

Specific Diseases and Procedures

Registries collect and store data on a specific disease, condition, or procedure.

NIH List of Registries: Outlines nationwide registries on a variety of diseases and procedures such as cancer, cystic fibrosis, down syndrome, epilepsy, and ALS.

Health Data and Statistics

Health surveys gather information from patients, healthcare providers, or organizations on a particular disease, behavior, or population.

The Guide to HHS Surveys and Data Resources: Lists health surveys sponsored by the U.S. Department of Health and Human Services that provide data on topics like substance abuse, healthcare facilities, immunizations, vital statistics, nutritional status, healthcare costs, and many other topics.

FastStats: An index run by the National Center for Health Statistics with statistics on different health topics.

Diverse Populations

All of Us Research Program: Provides researchers with access to genomic variants, survey responses, physical measurements, electronic health record data, and wearable data from diverse populations across the United States.

Mizzou affiliated researchers must create an account, complete mandatory training, and sign a Data User Code of Conduct to access the data.

Dataset Search Platforms

National Library of Medicine Dataset Search: Enables users to search for datasets from dbGap, Dryad, ImmPort, Harvard Dataverse, Borealis, Texas Data Repository, and UCLA Dataverse using keywords. The National Library of Medicine has plans to expand the number of searchable repositories.

Inter-University Consortium for Political and Social Research (ICPSR): A vast data archive spanning multiple disciplines such as geography, politics, medicine, and economics. Most datasets are available to download after agreeing to a data use agreement, though some require an application process to access.

DataOne: Users can search across this collection of environmental data repositories to find data pertaining to pollution, oceanography, climatology, zoology, and other fields related to ecology.

DataHub: Provides datasets on a range of topics such as healthcare, broadband, pollution, economics, and education. Some datasets are free to download, while others are available for a fee.

Google Dataset Search: Locate datasets from a variety of disciplines, locations, and time periods by searching with keywords.