Skip to Main Content

Text & Data Mining (TDM)

Freely Available Resources and Tools

The following resources are available for any researcher regardless of institutional affiliation

CrossRef DOI Registry Agency
Coverage: Allows access to metadata records for over 75 million scholarly works that have CrossRef DOIs, covering around 5000 publishers.  Can be used for text and data mining, checking against funder mandates, and to obtain metadata in a variety of representations.
Access method: General search interfaces and various APIs. Results available in JSON, Text, and XML
Limitations: No data use stated limitations; may be limited by publisher participation
For more information: Text and data mining for researchers or email support@crossref.org

Digital Public Library of America (DPLA) metadata
What it does: Allows programmatic access to metadata in DPLA collections, including partner data from Harvard, HathiTrust, Digital Virginias New York Public Library, ARTstor, and others
Access method: DPLA metadata is accessible by API or as zipped JSON files for bulk download. Must request an API key
For more information: API Codex or email codex@dp.la

Early English Books Online Text Creation Partnership (EEBO-TCP)
Coverage: Books printed in England, Ireland, Scotland, Wales and British North America and works in English printed elsewhere from 1473–1700
Access method: full-text access and search tools available to all via the University of Michigan EEBO-TCP site, downloadable full-text files available here, HTML, ePUB, and TEI P5 XML copies available through the Oxford Text Archive.
Limitations: no limitations on openly available data, access via ProQuest subject to terms of use
For more information: https://textcreationpartnership.org/faq/ or email tcp-info@umich.edu

Eighteenth Century Collections Online Text Creation Partnership (ECCO-TCP)
Coverage: English-language and foreign-language titles printed in the United Kingdom during the 18th century, along with thousands of important works from the Americas
Access method: Multiple ways to access, listed here (https://textcreationpartnership.org/tcp-texts/ecco-tcp-eighteenth-century-collections-online/)
For more information: https://textcreationpartnership.org/faq/ or email tcp-info@umich.edu

Internet Archive eBooks and Texts
Coverage: Over 11 million fully accessible books and texts
Access method: Searchable by web interface, with multiple download formats for individual works; instructions for a method for bulk download here
For more information: Internet Archive Ebooks and Texts or email info@archive.org

New York Times
Coverage: metadata and some content from New York Times articles 1851-present
Access method: Multiple APIs are available for different uses, full list here
Access restrictions: Free to access with registration and acceptance of terms of use
Limitations: Noncommercial use only, and users must agree to terms of use; API calls limited to 1,000 calls per day, and 5 calls per second
For more information: http://developer.nytimes.com/ or email code@nytimes.com

PLOS
Coverage: Every PLOS article, including all Articles and Front Matter. It does not include Figures or Supplemental Data.
Access method: Bulk download from https://allof.plos.org/allofplos.zip or visit https://api.plos.org/
For more information: PLOS Text and Data Mining