Text Mining Collections

PMC includes several large subsets or collections of articles where files for text mining and other purposes are made available under Creative Commons or similar licenses that generally allow more liberal redistribution and reuse than a traditional copyrighted work. License terms may vary by collection or even within a collection. Researchers interested in text mining can download complete collections or retrieve select articles based on a given topic or search. Please note that not all of the articles in PMC are available for this type of bulk download and reuse. (See PMC Copyright Notice for more information.)

To download a collection in PMC for text mining, you must use the designated services (usually the PMC FTP service).

On This Page:

Open Access (OA) Subset

The PMC Open Access Subset contains more than 2.75 million full-text articles. It is the largest collection of articles available via PMC for text mining and other types of reuse.

Articles in the Open Access Subset are made available for download under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work. OA Subset article downloads make the full text (XML, PDF, and .txt), images and supplementary materials available.

Within the OA Subset, there is

  • A Commercial Use Collection that includes only OA Subset articles that have a machine-readable “CC BY” or “CC0” license.
  • A Non-Commercial Use Collection that includes only OA subset articles in which reuse is restricted to non-commercial applications by the license or the license terms are not available in a machine-readable Creative Commons format.

To access the complete OA Subset, you should download both of these Collections.


Find all Open Access Subset articles in:

Learn about additional search filters that restrict results to certain license types.


The PMC OA Subset articles are available for download via the FTP service, PMC OAI-PMH, the OA Web Service, and BioC API.

Learn More

Author Manuscript Collection

The Author Manuscript Collection contains the full text (XML and .txt) of author manuscripts that have been made available in PMC in compliance with the NIH Public Access Policy or similar policies of other funders. The collection encompasses all NIH manuscripts posted to PMC since July 2008.

The Author Manuscript Collection contains the full text of more than 650,000 author manuscripts for text mining.

This Collection is distinct from the OA Subset and subject to different terms of use, i.e., these files are available for text mining and may also be used consistent with the principles of applicable copyright law.


Find all Author Manuscripts in:

Limit your search by publication date to find only author manuscripts that are included in the Author Manuscript Collection: AND ("2008/07/01"[PubDate] : "3000/12/31"[PubDate])


The Author Manuscript Collection is available for download via the FTP service and BioC API.

Learn More

Historical OCR Collection

The Historical OCR Collection, spanning more than two centuries of biomedical research, includes content from a subset of journals that participated in NLM's Back Issue Digitization Project (2014-) and Journal Backfiles Digitization Project (2004-2010).

The Historical OCR Collection contains full-text files for more than 130,000 articles, published in the 18th, 19th, and 20th centuries.


Find all Historical OCR articles in:

Please note that not all articles in this filter are availble for text mining.


With the publisher's permission (where applicable), the OCR text files from the journals included in these digitization projects are available for download via the FTP service. The Collection is organized by journal title.

Learn More

Support Center

Last updated: Tue, 17 March 2020