Text Mining Collections
PMC includes several large subsets or collections of articles where files for text mining and other purposes are made available under Creative Commons or similar licenses that generally allow more liberal redistribution and reuse than a traditional copyrighted work. License terms may vary by collection or even within a collection. Researchers interested in text mining can download complete collections or retrieve select articles based on a given topic or search. Please note that not all of the articles in PMC are available for this type of bulk download and reuse. (See PMC Copyright Notice for more information.)
To download a collection in PMC for text mining, you must use the designated services (usually the PMC FTP service).
On This Page:
Open Access (OA) Subset
The PMC Open Access Subset contains more than 2.75 million full-text articles. It is the largest collection of articles available via PMC for text mining and other types of reuse.
Articles in the Open Access Subset are made available for download under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work. OA Subset article downloads make the full text (XML, PDF, and .txt), images and supplementary materials available.
Within the OA Subset, there is
- A Commercial Use Collection that includes only OA Subset articles that have a machine-readable “CC BY” or “CC0” license.
- A Non-Commercial Use Collection that includes only OA subset articles in which reuse is restricted to non-commercial applications by the license or the license terms are not available in a machine-readable Creative Commons format.
To access the complete OA Subset, you should download both of these Collections.
Search
Find all Open Access Subset articles in:
- PMC with this search filter: open access[filter]
- PubMed with this search filter: pubmed pmc open access[filter]
Learn about additional search filters that restrict results to certain license types.
Download
The PMC OA Subset articles are available for download via the FTP service, PMC OAI-PMH, the OA Web Service, and BioC API.
Author Manuscript Collection
The Author Manuscript Collection contains the full text (XML and .txt) of author manuscripts that have been made available in PMC in compliance with the NIH Public Access Policy or similar policies of other funders. The collection encompasses all NIH manuscripts posted to PMC since July 2008.
The Author Manuscript Collection contains the full text of more than 650,000 author manuscripts for text mining.
This Collection is distinct from the OA Subset and subject to different terms of use, i.e., these files are available for text mining and may also be used consistent with the principles of applicable copyright law.
Search
Find all Author Manuscripts in:
- PMC with this search filter: author manuscript[filter]
- in PubMed with this search filter: author manuscript[filter]
Limit your search by publication date to find only author manuscripts that are included in the Author Manuscript Collection: AND ("2008/07/01"[PubDate] : "3000/12/31"[PubDate])
Download
The Author Manuscript Collection is available for download via the FTP service and BioC API.
Historical OCR Collection
The Historical OCR Collection, spanning more than two centuries of biomedical research, includes content from a subset of journals that participated in NLM's Back Issue Digitization Project (2014-) and Journal Backfiles Digitization Project (2004-2010).
The Historical OCR Collection contains full-text files for more than 130,000 articles, published in the 18th, 19th, and 20th centuries.
Search
Find all Historical OCR articles in:
- PMC with this search filter: historicalocr[filter]
Please note that not all articles in this filter are availble for text mining.
Download
With the publisher's permission (where applicable), the OCR text files from the journals included in these digitization projects are available for download via the FTP service. The Collection is organized by journal title.