Diving into the haystack to make more hay?

Diving into the haystack to make hay is one of the most inefficient activities imaginable (as well as a figurative absurdity). Doing science inevitably entails discovery, but the process has historically been far more difficult without effective tools to support it. With the quickening pace of scholarly communications, the vast volume of information available is overwhelming. The key puzzle piece might be as hard to find as a needle, and moreover, no scholar has the time to dive into the hay-verse to make more hay.

With recent advances in data science and information management, research discovery has become one of the most pressing and fastest growing areas of interest. Reference managers (ReadCube, Mendeley, Zotero) and dedicated services have been experimenting with novel ways to deliver relevant articles to scholars (PubChaseSparrhoEigenfactor Recommends). While we have by no means exhausted the number of ways to innovate and make them mainstream, we have yet to turn our attention to scholarly objects beyond the article. The utility of finding the full research narrative is established, a given. But the potential value of discovering and accessing scholarly outputs that are created before final results are communicated or never integrated into an article at all is almost untapped. Until now.

Scholarly Recommendations

Last year, our partnership with figshare began with the hosting and display of Supporting Information files on the article, accommodating the broad range of file types found in this mixing pot. Both the Supporting Information file viewer and figshare‘s PLOS figure and data portal increase the accessibility of the data and article component content associated with our articles. The latter makes PLOS figures, tables, and SI files searchable based upon key terms of interest.

Beyond the article: Today, we continue to build on the figshare offering with the launch of research recommendations, a service which delivers relevant outputs associated with PLOS articles and beyond. This begins to fill a critical need for tools that address the full breadth of research content. Rather than being limited by the article as a container, we can now present a far broader universe of scholarly objects: figures, datasets, media, code, software, filesets, scripts, etc.

figshare recommendations widget

 Hansen J, Kharecha P, Sato M, Masson-Delmotte V, Ackerman F, et al. (2013) Assessing “Dangerous Climate Change”: Required Reduction of Carbon Emissions to Protect Young People, Future Generations and Nature. PLoS ONE 8(12): e81648. doi: 10.1371/journal.pone.0081648

While papers tell the story of the final research narrative, data – the building blocks of science – are especially critical to the progress of research. They underlie the results expressed in a narrative that is published in papers. The most rigorous and expansive path of discovery includes not only related articles, but arguably even more fundamentally, data and a host of research outputs that lead up to the paper or may even be independent of an article. PLOS’ data availability policy was the foundational step, ensuring that data is publicly accessible. Delivering strong recommendations to surface relevant research now adds even more value for the scholarly community.

Beyond the publisher: the recommendations delivered by figshare extend beyond research outputs attached to PLOS publications. In fact, they are retrieved from the entire figshare corpus of over 1.5 million objects. We want to enrich the discovery experience for users using the breadth of possible OA research outputs, regardless of whether they have been published as part of a research paper. Not all scholarly outputs may fit in an article, but might very well be critically instrumental to others’ research.

Right at your finger tips

The recommendations are displayed for every PLOS article on the Related Content tab. To select the most related ones, figshare uses Latent Semantic Analysis across the entire PLOS corpus to build a “semantic” matrix, which is then used to retrieve a list of best related entries for each of the articles. Five recommendations are displayed with the option to load more. The type of file is denoted by icons or thumbnails when available, with a preview of the object upon hover-over. The full view of the file is available by clicking on the thumbnail. The content and all its metadata is available on figshare via the highlighted title. Keyword tags are also displayed, which can be used to find other associated content of that kind.

figshare recommendations widget2

Franzen JL, Gingerich PD, Habersetzer J, Hurum JH, von Koenigswald W, et al. (2009) Complete Primate Skeleton from the Middle Eocene of Messel in Germany: Morphology and Paleobiology. PLoS ONE 4(5): e5723. doi: 10.1371/journal.pone.0005723

Mark Hahnel, founder of figshare, said “PLOS has continuously demonstrated their desire to advance academic publishing and we’re always very happy to play a part in their innovations. The latest developments will ultimately make figshare content more discoverable and benefits our user base as well as PLOS readers and authors.”

With the figshare recommendations, it is our aim to advance the process of discovery and accelerate the research lifecycle itself. Please check them out on the Related Content tab at every PLOS publication, dig into the offerings, and see where your research goes. We welcome your thoughts and reflections. Are these useful and relevant? Would you like them delivered through additional channels? Feel free to comment here or contact @jenniferlin for PLOS information. figshare is also available via emailtwitter, facebook or google+.

Cross-posted on figshare blog.

Category: Tech | Leave a comment

A Step-by-Step Approach to Content Management

lemurLoginIllustration

Image credit: Michael Morris

Today at PLOS, we celebrate an important milestone: We launched the first iteration of the new PLOS CMS (codename: Lemur). This first installment is a homepage editor for six of the PLOS journals (Biology, Medicine, Pathogens, Computational Biology, Genetics, and NTDs). Our homepage editor is a browser application that facilitates curating and preparing content to feature on the journal homepages. It lets our journal staff queue up items to feature, write blurbs about each item in a WYSIWYG editor, select images, and perform basic image manipulations. It has a drag-and-drop interface to easily reorder featured content, and as you’d expect, it includes previewing and publishing controls.

Our launch timing, not so coincidentally, corresponds to the launch of our updated homepage design for the aforementioned six journals. The new, more sophisticated homepage design called for sophisticated curation controls, so it was an easy call to make the homepage editor the first functional area for the new PLOS CMS to tackle. The old process for homepage updates was manual and restrictive. But we’re restricted no longer—journal staff now have all the controls they need at their fingertips, without having to be HTML experts.

PLOS’s new CMS is an internal tool… for today. But that could evolve over time, as the scientific publishing community moves to a brave new world that’s “beyond the journal”. I’ll dig deeper into the idea of meeting an evolving landscape by answering the question you’re probably asking yourself right now:

Why are we building our own CMS?

There are several categories of answers to this question, presented here in no particular order:

  • Flexibility: We could have gone with an off-the-shelf CMS. But it’s likely that we would have actually required two different off-the-shelf content management systems: one to manage our web content, and one to manage our article corpus. These two content types generally utilize two very different kinds of CMS. The idea of adopting—and adapting—two different CMSes (or select just one type of CMS and heavily customize it to work with both content types) was not very appealing. This is a particularly unappealing solution when viewed through the lens of our ultimate goal, which is to blur the lines between all of our different content and media types, and interact with and present any combination of them seamlessly, in ways we can’t necessarily predict today. That’s why we have opted to separate the functions of content management, and build a curation layer for pulling the content together in the ways we need.
  • Innovation: We don’t want to miss opportunities to innovate on content delivery. We suspect that innovation would be slowed if we become locked into monolithic systems that are difficult to customize.
  • Technical considerations: Separation of content management functions from a technical perspective makes a lot of sense. You hear of a variety of publishing operations opting to roll their own for the same reasons, including the New York Times. This separation should also make us more agile in the sense of being able to respond relatively quickly to emerging needs, new ideas for innovation, or other interesting technological developments we want to try out.
  • Longer term community ideas: We start today by building Lemur with specific curation tasks in mind. But it’s easy to imagine a future in which anyone could curate PLOS (and other) content, using tools we provide. Imagine societies or educators using our content and curating it to their own needs, which may or may not be tied to the traditional model of the journal. Many examples of how this could evolve spring to mind!

Lean makes it better

We’ve been using Lean methodology to develop the content management system, component by component. We’ve been testing our wireframes throughout the design process. We’ve co-located the entire team in a conference room—since February, mind you! We’re reaping the benefits (and enjoying the process) of pair programming. Close collaboration among developers, UX, and product owner noticeably improves both our product and our velocity, while collective code ownership ensures maintainability. And here’s the part that requires lots of discipline: We release a minimum viable product, so we can gather observations of how it is used and see how it could be improved, rather than anticipate the entire product in a vacuum. As we improve the individual features with the findings from our observations, we will also incorporate that learning in the approaches we take with the rest of the system.

 

Category: Tech | Tagged | Leave a comment

Delving into subject areas with PLOS Cloud Explorer

As discussed in a previous post in the PLOS Tech blog, PLOS uses a sophisticated approach to classify research articles according to what they are about. Using machine-aided indexing, articles are associated with subject areas from a thesaurus containing over ten thousand terms. You can now explore an interactive visualization of the entire thesaurus, which uses article data from PLOS journals to show how different fields of research are interrelated, and how that has changed over time. Check it out: PLOS Cloud Explorer.

We made this web app while we were students together in a course on working with open data at the UC Berkeley School of Information last spring. We were interested in doing something with open data pertaining to scholarly literature, which would enable both researchers and curious members of the general public to explore trends in research and interactions between research topics. Naturally, we looked to PLOS as a source for open data about scientific research. As a publisher of open access journals, PLOS articles and metadata are all Creative Commons-Attribution licensed. PLOS has an open search API as well, which provides access to full article data and metadata—including sets of subject area terms for each article, which specify the position of each term within the polyhierarchy of the thesaurus. We wanted to build a tool that would allow users to navigate across fields and reach real articles by harnessing this rich, faceted representation of research areas that is bound to PLOS article data. When we asked about the thesaurus, Rachel Drysdale kindly provided us with a full copy—it’s now also available on GitHub.

PLOS-chem-rxns

The fabulous complexity of PLOS’s classification of research articles hasn’t really been surfaced on the PLOS website. Although PLOS ONE has a subject area browser as part of its search interface, we found this difficult to navigate as part of an exploratory search, and started thinking of ways to add context to this kind of experience. We decided to create an interactive tree, using D3.js, that illuminates the larger structure of the relationships between research areas. As you browse the tree, graphs in the dashboard show how many articles have been published each year within the current field, and which other major disciplines those articles are also associated with. The word cloud shows which specific subject terms (the leaves of the tree) are most prevalent among articles in the selected field, and clicking on a word in the cloud takes you directly to a query of that term on the PLOS website. Early on in the making of this tool, we were inspired by a word cloud example of a specific query, and built PLOS Cloud Explorer around this notion of using a dynamic word cloud, filtered on interactive charts that provide context, to reach real documents of interest.

PLOS Cloud Explorer reveals the interconnectedness of research areas that are represented and developed in PLOS journals. The word cloud and the histogram visualizations show that many fields of study are highly interconnected: PLOS articles tend to be associated with interdisciplinary research, such as combining Medicine and Physical Sciences. You can also observe and explore trends in the number of articles over time for a given field (using the time series graphs), and also trends in the collaborations among research areas (using the histogram and word cloud). We hope you enjoy exploring!

What you see in PLOS Cloud Explorer is based on data about all the 126718 articles published in PLOS journals up until July 21, 2014, and represents a snapshot of the PLOS Thesaurus in its current state of evolution. You can find our source code and documentation on GitHub.

About the authors: Anna Swigart and Colin Gerber are graduate students in the UC Berkeley School of Information. Akos Kokai is a graduate student in the Department of Environmental Science, Policy, and Management at UC Berkeley.

Category: Tech | Leave a comment

Making Metrics Count – ALM Article Feature Series

In this age when we are all obsessed by counting, should we be celebrating yet more sets of metrics? Albert Einstein famously quipped: “Not everything that can be counted counts, and not everything that counts can be counted.” While a well-worn sentiment, it does bear some thought. At PLOS, we believe we should celebrate—though not journal-level metrics—but those of individual articles and the diverse metrics and the stories associated with them.

The ALM Article Feature series is an ongoing and regularly published set of posts that highlight articles that have caught the eye of the editorial teams across the PLOS journals. We examine their notable metrics as well as telling some of the stories behind the articles. We don’t have any fixed criteria for articles in this series, but rather have asked the journal teams to highlight articles that had meaningful metrics for their journal. As you’d expect, there will be an eclectic mix selected. This series will not only highlight individual articles but will celebrate what in the end is a core editorial function of journals—curating content that matters to their, and we hope wider, audiences.

Currently, the ALM Article Feature Series includes the following posts:

  1. From One to One Million Article Views on PLOS Medicine’s Why Most Published Research Findings are False by John Ioannidis
  2. You Just Read my Mind… on PLOS Biology’s Reconstructing Speech from Human Auditory Cortex by Brian Pasley, et al.
  3. “Low T” and Prescription Testosterone: Public Viewing of the Science Does Matter on PLOS ONE’s Increased Risk of Non-fatal Myocardial Infarction following Testosterone Therapy Prescription in Men by William Finkle, et al.
  4. Reflections on feces and its synonyms on PLOS NTD’s An In-Depth Analysis of a Piece of Shit: Distribution of Schistosoma mansoniand Hookworm Eggs in Human Stool by Stefanie J. Krauth, et al.
  5. How Much of Your Genome is Functional? on PLOS Genetics’ 8.2% of the Human Genome Is Constrained: Variation in Rates of Turnover across Functional Element Classes in the Human Lineage by Chris Rands, et al.

We will continue to update this list with the latest additions. Also, you can follow it on twitter at #celebratingalms to discover the newest posts, join the community’s conversation about the articles chosen, and tell us the PLOS articles you want highlighted next. With article-level metrics, we look forward to sharing the breadth of fascinating ways in which PLOS science has impacted scholarly research and the broader world beyond.

Category: Tech | Leave a comment

R Markdown for Scholarly Communication?

Open question: how much longer will researchers be limited to submitting manuscripts in formats like Word or LaTeX to STM journals?

I caught a glimpse of a potential alternative at a recent Software Carpentry Bootcamp hosted by UC Davis.

Software Carpentry is a volunteer organization that teaches basic computing skills to researchers. Instruction typically takes the form of 2 to 3 day hands-on workshops aimed at helping researchers, typically graduate students, work more effectively with data.

The event was a blast. Topics can vary from bootcamp to bootcamp and included the following in the session I attended:

  • Basic UNIX Shell Functions
  • Data Munging
  • Basic Version Control using Git
  • Data Sharing with GitHub
  • Intro to Programming in R for data manipulation and visualization
  • Markdown and R Markdown for creating text documents that run R code
  • Intro to ggplot2, a plotting system for R that make creating beautiful plots easy

Thanks in large part to Software Carpentry’s patient and capable volunteers, after two full days I was able to write short scripts, post them to GitHub, and create simple visualizations using sample data.

countries.png GDP per capita from 1950 through 2010

After the first day of instruction (and over drinks) a discussion popped up over the relative merits of R Markdown and IPython Notebook – two potential alternatives to Word and LaTeX. What makes the formats compelling is their ability to leverage R or Python code to create dynamic documents.

Both R Markdown documents and IPython Notebooks can easily be converted to HTML, PDF and even Word – formats that scholarly publishers are generally familiar with. While these conversions make it easy for researchers to produce traditional research outputs, I wonder if there is a better way for publishers to leverage these formats. It also makes me curious to hear from the research community directly. How can publishers leverage formats like R Markdown and IPython Notebook to facilitate enhanced scholarly communication? Any and all suggestions are welcome.

Category: Tech | 1 Comment

Why I am a product manager at PLOS: Linking up value across the research process

 Why am I a product manager at PLOS?

I am a product manager [1] at PLOS because of the thrilling opportunity to re-imagine and create better conditions for researchers to do and share science. Those far more insightful and eloquent have enumerated the existing system’s thousand points of failure to support the research enterprise.  What follows is a set of thoughts that originate from traditions borrowed outside of the everyday practice of science and, more broadly, scholarly communications.

While the research enterprise is comprised of vastly diverse activities, the overall set can be generalized at the highest level with a simple rubric (Four C’s):

Create: do something that inspires or catalyzes a new idea
Capture: externalize and preserve the idea
Communicate: make the artifact extensible and disseminate it, making it accessible for others
Credit: integrate work into existing credit systems

Science easily fits into this model. A researcher will design a project and conduct experiments (Create), analyze and synthesize the results (Capture).  She will disseminate the narrative by publishing a research article (Communicate) and seek credit for work done (Credit).  In the traditional rendering of the research cycle, you move from one phase to another in an orderly and straightforward fashion.  The sequence seems to make sense.  (What is there to “communicate” before something is “captured,” if not scientific misconduct?)

But does the process of scientific discovery actually work this way?

Not really.  Researchers continually discuss results so as to get feedback that is folded into their next set of experiments or analyses.  They do it informally through lab meetings, departmental gatherings, digital platforms (ex: Mendeley, Twitter, blogs, etc.), and any myriad of unplanned “water cooler” encounters.  They also do it formally through conference presentations, seminar talks, research articles, etc.  These recurring encounters are an intrinsic part of the process whereby scientific ideas are propagated, confirmed/refuted, and built upon by the community.  Science is very much a complex, social enterprise in this sense.  But this means that the four phases described above are more accurately depicted as overlapping modes of doing science.  Here, we no longer have a simple sequence of discrete moments. Rather, we have a dynamic set of live events that may start at an identifiable origin, but then proceeds to extend, fork into multiple branches, intertwine, double-over, etc.

The model in which research moves by sequence through the research lifecycle seems to me a paltry simplification of the larger process.  It might get frenetic and messy with multiple projects underway with different collaborator networks, each at different points of development, moving at different speeds.  But with so many simultaneous movements at play inside and outside the lab, no wonder doing science is so exciting!

I will elaborate on the first three C’s in subsequent posts, but focus here on the last one.  The fourth mode, Credit, stands apart from the others by its very nature and plays a more significant role in forming the environment where research acts play out. By and large, it is the mode single-handedly identified with the production of value.  The activity is embedded within formal institutions such as evaluation committees at institutions and funding organizations.  Informal indicators may attest to a research output as product of one’s work and thus accrue favor amongst colleagues. But credit is formalized by virtue of the outcomes available once credit is assigned. [2]  In the strictest sense, it counts as much insofar as it advances one’s reputation within the established systems that endow benefit to those awarded.  Credit forms the basis of an incentive structure, which then shapes the activities of those tied to it and in this sense, considered the least malleable.  It is considered the intended - explicit or implicit - terminus for every work unit started (in relation to the expected outcome of the work, not the personal motivations which drive the researcher to do it.)  The reward itself is presumed to account for the efforts entailed.  In the existing incentive structure for scholars, this still means publication in a high-impact journal or of a highly cited article.

Not surprisingly, this way of thinking plays a large role in shaping and reinforcing the linear view of how science works.  If there is no credit until the condition of citation is satisfied, we have no other narrative possible than the straight progression of the discrete phases depicted.  If the only work product that really counts is a research paper, published 5 years after the project commenced, which must wait another few years before citations accumulate, then we have a seriously protracted time lag before any evidence of contribution is formally recognized.  In this environment, why would we be surprised to discover that the entire practice of research has been reduced to a simple, linear process.  And this notion merely perpetuates the misconception that Credit, supposedly at the end of the single-link chain, can sufficiently reflect (and vindicate) the work which preceded.  Science is too fertile of an enterprise for this to suffice.  Researchers deserve better.  Fortunately, the prevailing notions of value and credit which underlie the incentive structure are beginning to change.

My sense – one shared by my colleagues – is that value is created throughout the lifecycle, not just when Credit is awarded.  If we share in a way that can be tracked as well as establish measurements for such outputs, we are capturing value across practices, not just recorded activities but actual practices.  If we capture value at each stage along the way, we can formally assess and recognize far more work products, not just surfaced pieces contrived to fit the final narrative.  If we think of the Four C’s as separate strands with possible outcomes which might be identified; compared; and measured, we could glimpse a far richer view of how scientific ideas impact each other to extend far beyond our anemic one recognized today.  If we take a broader view of the production of value and create formal mechanisms for its distribution and allocation, we create an economy that supports the myriad ways in which researchers are doing science and engaging with others’ work.  And we create a holistic environment more conducive to the advancement of research and far more supportive of all involved in the enterprise.

I am a product manager because I think all this is possible with social change and structural realignments (of technology and policy) across the research ecosystem.  I am a product manager because it is possible to transform how we access, track, share, discuss, discover, interrogate, and evaluate research findings — nothing short of how we do science.

 

Footnotes:
1. I respectfully put the oft-asked question of “what a product manager does” aside for another occasion. If hows come also before whys, we would never get started for all the contradictory biographical minutiae to satisfy both the oppositional poles of narrative and truth.
2. Here, the point pertains to role, not person. A colleague may also be a formal evaluator of one’s work.

Category: Tech | Leave a comment

ALM, the Research of Research – Recent Developments

Article-Level Metrics (ALM) capture a broad spectrum of activity on research articles, offering a window into how researchers engage with scientific findings. We are beginning to understand what these data mean as ALM matures, the momentum continues to build, and the broader scholarly community joins the conversation. One important aspect is scholarly research. The June 23 altmetrics14 conference, part of the ACM Web Science Conference, is an important venue to present and discuss this work. PLOS has participated in three projects to be presented (abstracts).

Brainstorming community needs for standards & best practices related to altmetrics

Todd Carpenter and Nettie Lagace  from the National Information Standards Organization (NISO), together with one of the authors of this post (MF), will present work on Brainstorming community needs for standards & best practices related to altmetrics. This work was done as part of the first phase of the Sloan-funded NISO Alternative Assessment Metrics Project, and is summarized in a white paper that went up for public comment this Monday. The white paper lays out 25 potential action items for further work by NISO and the community. These action items were grouped in broad areas such as terminology, use cases, data quality, aggregation, and context. The white paper was written by Martin Fenner (who chairs the NISO Alternative Assessment Metrics Project Steering Group), Todd Carpenter and Nettie Lagace, but captures the views expressed by the community via three in-person meetings, 30 personal interviews and many discussions in the NISO Alternative Assessment Metrics Steering Group. The document is currently released for public comment through July 18, 2014, and we encourage all to contribute your thoughts.

How consistent are altmetrics providers?

Zohreh Zahedi and Rodrigo Costa from the Centre for Science and Technology Studies (CWTS) Leiden, together with one of the authors of this post (MF), investigated the consistency of ALM data across different aggregators in Analysis of ALM data across multiple aggregators: How consistent are altmetrics providers? Study of 1000 PLOS ONE publications using the PLOS ALM, Mendeley and Altmetric.com APIs. Building off of a similar study by Scott Chamberlain (2013), they found rampant discrepancies between the counts harvested by altmetric.com, Mendeley, and PLOS across a number of sources, but focussing on Mendeley, Facebook and Twitter. The results pose a serious challenge to data validity that needs to be addressed as soon as possible, and the conference is a good venue to start this discussion.  In this light, the authors call for greater transparency of data collection and provenance as well as convergence in the methods of collection.

Wikipedia references across PLOS publications

The third project, from the authors of this post, explores Wikipedia references across PLOS publications. As the importance of Wikipedia within the scholarly community is growing – Wikipedia is among the top 10 referrers to scholarly articles – little is known about the referencing behaviors (i.e., how, what types, etc.). The preliminary view of the data shows that coverage is moderate (with coverage on par with science blogs) and international (with only half the mentions of PLOS articles in the English Wikipedia). The pattern of references is distinct from popular social networks and other ALM (including usage and citations). Correlation was found instead between the number of Wikipedia references and the number of active editors for each Wikipedia.

 

Category: Tech | Leave a comment

Research on the move – new mobile sites for PLOS journals

PLOS is pleased to announce new mobile websites for its suite of influential journals. The mobile journal experience has been optimized for easy browsing on small screens, with a simplified interface that highlights popular and newsworthy content. It features:

  • Prominent article titles and abstracts for quick browsing
  • Condensed article sections to make it easy to get to the right content
  • Flexible display options for search and browse results
  • Streamlined figure views with the option to open and zoom into full resolution figures

PLOS Mobile

As is sometimes necessary, this initial release does not include every bit of functionality you currently see on the full site. For example, while you can read article comments from your phone, posting new comments is not yet supported. We’ve worked hard to include those features researchers are most likely to utilize from a phone, and for content or functionality not yet optimized for phones we’ve provided links to corresponding pages on the full site.

All the articles that PLOS publishes have always been immediately free to access online, distribute and reuse; now they are available to readers wherever they roam. Let us know how you conduct research on the go!

Category: Tech | Leave a comment

Thesaurus evolution – a case study in “Synthetic biology”

Science does not stand still and neither does the PLOS thesaurus. With more than 10,700 Subject Area terms, we use the thesaurus to index our articles and provide useful links to related papers, enhanced search functions, and, for PLOS ONE (more than 90 articles published every day!), customizable Subject Area-based email alerts and Subject Area landing pages.

Sometimes we decide to renovate a sector of the thesaurus to better reflect the make-up of the PLOS corpus. For example, we’ve long had a Subject Area term for “Synthetic biology,” sitting beneath “Biology and life sciences.” We even have a healthy Synthetic Biology Collection. However, the Subject Area term “Synthetic biology” was being applied to only a handful of articles despite the fact that many more PLOS articles were about synthetic biology and should ideally have been indexed accordingly. Why was this?

Part of the explanation is that ‘synthetic biology’ is not a phrase that is frequently used in natural language. So whereas an article about hypertension may use the word ‘hypertension’ 26 times within the text, an article about synthetic biology might state ‘synthetic biology’ rarely, if at all. This poses a challenge to the Machine Aided Indexing process which assigns Subject Areas to articles based on the frequency of matches in the text.

The way around this is to introduce a level of abstraction to the rulebase that governs the Machine Aided Indexing. The base rules are very literal: “if I see ‘synthetic biology’ in the text I’m going to use the ‘Synthetic biology’ Subject Area term.” But there are additional words and phrases that are diagnostic of synthetic biology topics, such as “biobricks” and “Registry of Standard Biological Parts.” Adding rules for these terms – for example “if I see ‘Registry of Standard Biological Parts’ in the text I’m going to use ‘Synthetic biology’” – increases the frequency of indexing to “Synthetic biology” and thus the retrieval of relevant articles in our searches.

A second factor is to do with the hierarchical structure of the thesaurus – an especially important factor given that our search functionality is designed to utilize this hierarchy. For example, a Subject search for “Vascular medicine,” beneath which Hypertension sits, retrieves articles indexed specifically with Hypertension, even if they have not been explicitly tagged with “Vascular medicine.” In earlier versions of the PLOS thesaurus “Synthetic biology” had no narrower terms, and this was doing it no favours with regard to how useful it was for retrieving relevant articles. We therefore reviewed essays about synthetic biology, scope descriptions from relevant institutional and departmental web sites, and proceedings from synthetic biology conferences, all in light of the content of our articles, and introduced new, narrower terms to sit beneath our existing “Synthetic biology” where that made sense.  So we went from having the single “Synthetic biology” term to the new structure of 30 terms in one renovation.  Here is what we have now:

synbio

Much of the evolution of the PLOS thesaurus is gradual, as for example when we realised that “puma” can be used as an abbreviation for “p53 upregulated modulator of apoptosis” as well as a kind of big cat, or learned that asteroids can be starfish. Dealing with these indexing missteps requires small-scale changes to specific rules. But sometimes the change needs to be more radical. Our new “Synthetic biology” sector was implemented in Ambra 2.9.12 (released March 26th, 2014). Where previously only a handful of articles was indexed with “Synthetic biology,” now a Subject search across all PLOS journals retrieves over 400 “Synthetic biology” articles – much more fitting for this important and developing field.

For more about the work PLOS is doing with Synthetic biology see “An Invitation to Contribute to the Second Life of the Synthetic Biology Collection.”

Category: Tech | 1 Comment

Getting to CrossMark

This week, we launched our participation in CrossRef’s CrossMark program. It’s an exciting step for PLOS, and getting there was a learning experience we hope you’ll find interesting.

The Program

CrossMark is a service of CrossRef that is gaining traction among scholarly publishers, with more than 30 publishers to date, and nearly half a million scholarly documents. The purpose of the CrossMark logo appearing on article pages is to give researchers a consistent way to know the status of any article, from any participating publisher. When someone clicks the CrossMark logo from either the online version of the article, or the PDF, they see a popup like this one. It indicates that either the article is up to date, or that updates are available.

crossmark_final

It’s clear that the CrossMark service is valuable for keeping content current, which assists the integrity and completeness of the scholarly record. It’s also worth highlighting that we’d like our initial CrossMark participation to be the first step toward additional exciting uses in the future. We could extend our CrossMark usage to…

  • support article versioning
  • display FundRef info
  • display info about our peer review process
  • link to related data
  • experiment with threaded publications
  • …and more

The Journey

Getting from “we want to participate in CrossMark” to “the CrossMark logo is live” was a process that took time. Seven months, if you want to know the truth! Don’t let that scare you if you’re a publisher interested in kicking off your own CrossMark participation. The main reason it took us 7 months is that we bundled the CrossMark initiative into a larger corrections handling overhaul, which included a massive data migration effort. Anyone who has been through one of these will tell you the same thing: data migrations are not for the faint of heart. And in retrospect, this bundling of initiatives was a decidedly un-Agile way to go.

So the overall initiative included overhauling our corrections handling process, which meant switching systems for inputting and publishing correction notices. This new process required system development, which in turn required documentation, training, and hands-on practice for a pretty big chunk of our staff. And then there was the data migration effort, which took a long time on its own. (None of this part of the initiative included our CrossMark program implementation.)

Then, we tackled the CrossMark piece, which was fairly straightforward in the scheme of the overall project. We added the CrossMark logo to articles: the CrossMark logo now appears on every PLOS article page on our journal sites, and on the downloadable PDFs for all newly-published articles going forward. And we updated our deposit toolchain to include the CrossMark metadata. But there were a few complications, because of the aforementioned data migration.

First, we chose to create a back-deposit of CrossMark data for our entire corpus. Over ten years of publishing equals somewhere around 110,000 articles, as well as over 3,000 migrated corrections. Naturally, things change over time. How does a person get a grasp of the minor differences between article XML generated over ten years? You can look at a few files from various periods in each year, but that’s just barely scratching the surface. You still have no clear idea of what might actually be different. A metaphorical needle in a gigantic digital haystack. So we wrote some XSL transforms, threw the whole lot at ’em, and temporarily kicked some cans down the road. We figured we’d let CrossRef’s submission results tell us if something was wrong. After sending off 110,000+ XML files (with a slight chuckle) and letting the script run for about twelve hours, we had a pretty decent success rate. After some slight tweaking, the rest were good to go as well.

Dealing with back-deposits for our migrated corrections was a bit dirtier, and required a little more clean-up. First they had to be re-formatted simply for display on our website in their new form, and then mined for the needed CrossMark deposit information before sending the XML off for deposit (thanks for that .jar file, CrossRef!). The vast majority of the work was accomplished with a small toolset, really. Some .jar files (provided by CrossRef), and some XSLT files did most of the heavy lifting. Though how you compile and prepare your corpus could vary from ours.

And now a few words about article PDFs for our CrossMark program. As we mentioned, the CrossMark logo appears on PDFs for articles we publish going forward. We chose to back-update the online versions of our articles to include full CrossMark functionality, but we decided not to update the 110,000+ downloadable PDFs for previously-published articles. It was a decision based more on our unique volume situation, and less about the process of updating the PDFs. The marking and stamping process is simple, once you have it set up. But we decided that the testing and remediation challenges associated with replacing 110,000+ active PDFs was too much to take on at this time. CrossRef leaves it up to the publisher in terms of whether you choose to fully update your corpus, or start participating in CrossMark from a given date onward. We took a bit of a hybrid approach because we chose to add CrossMark functionality to all HTML articles, but only to PDFs for newly-published articles.

So there you have it! Overall, getting to CrossMark turned out to be a bit more of a journey than we anticipated, but we have arrived, and we’re glad we took the trip. We hope this post is useful to any of you who may be considering kicking off a CrossMark participation program of your own.

Category: Tech | Leave a comment