By Julie West, publications specialist
Office of Research and Economic Development
August, 2010
I’m talking about digital scholarship with Andrew Torget, assistant professor in the Department of History. Torget received a $50,000 start up grant in August from the Office of Digital Humanities, a division of the NationalEndowment for the Humanities, to develop search models combining text-mining and geospatial mapping that can effectively help scholars research large-scale collections of digitized historical newspapers. The project will use digitized texts from the NEH funded, Chronicling America, the national digital newspaper archive project.
Julie: Andrew, good morning. Your project is titled, Mapping Historical Texts: Combining Text-mining & Geo-visualization to Unlock the Research Potential of Historical Newspapers …
Andrew: (interjects) The title just rolls off the tongue, doesn’t it?
Julie: (laughs) Indeed! So why don’t we start with you giving a summary introduction to your project.
Andrew: The project is trying to deal with a new reality that historians, in particular, but humanists in general are dealing with which is, “What do you do with too much stuff in the digital age?” It used to be when someone like myself, a historian, was doing research, you’d go out and literally find everything you could on the subject … because the bottom line was access. That’s been completely flipped on its head with projects like UNT’s Digital Projects Unit and the digitization of early Texas newspapers — where you’ve got 200,000 pages of newspapers from the NEH’s Chronicling America project. And so now the question is: How do you deal with scale on that level? With 200,000 pages, there’s far too much information to review by hand — turning every single page by hand. You simply cannot do this. So the challenge is: How can you deal with this kind of scale to find meaningful patterns? The whole idea of this project is to take a massive amount of historical data, and try to use a couple of different techniques to see what new insights we can get out of it.
Julie: Talk about text-mining. What is it? And how does it differ from geo-visualization?
Andrew: Text mining deals with identifying patterns in language, and that’s a new methodology for history. Rada Mihalcea (UNT associate professor) in Computer Science and Engineering is working with us to develop algorithms specifically designed to pull out meaningful language patterns in the data that’s in historical newspapers. And no one’s ever dealt with this scale of historical information before. So, for example, when you’re writing algorithms for this, we might take a set of historical newspapers from the Civil War era and say – “go find every instance of the word ‘Lincoln’ in these historical texts.” And when you’re done with that, you’d then ask the computer to find all other words that are within 5 words of Lincoln. And then measure (X), and give me those results.
Julie: So you can keep fine-honing the algorithm task itself. You add a variable. You subtract a variable.
Andrew: That’s the idea. It’s a process. You write one, and you may get terrible results. So you go back and fix it. And after doing this so many times, you’ll have lots of results, but hopefully you’ll also have a set of algorithms that will be very attuned to asking questions of these newspapers, which can then be adapted by other folks.
So we’re going to pull out these particular language patterns, but even then there’s going to be a lot of information in there. So we’re also using another technique, a visualization technique, which we call geo-visualization -- that basically means digital mapping — we’re going to be mapping the language patterns that come out of these newspapers. We’re going to be mapping ideas, conversations, and movements of concepts as they move across the landscape, as they go from one newspaper to another and get printed across Texas.
Visualization has been around a long time. Technique wise, that’s not a new approach. Part of what we’re doing that’s innovative is to merge those two techniques together in a way that hasn’t been done before. Because with text mining, we’re going to find patterns that are far larger and far more complex than I can sit down and make sense of from just looking at massive collections of words. So instead we’re going to use maps of the data to visualize those results. In that sense, these are two different techniques for making sense of large collections of data, and we’ll put them together to hopefully open up new possibilities for understanding historical sources. That’s the concept — to develop an x-ray machine, essentially, for historians to use and look more deeply into these sources to find what’s useful so we can do better research.
Julie: Are you working with specific models, or specific subjects that will help you refine your research? For example, you’re interested in the South — emancipation, slavery — will you use these subjects to develop the models for your research?
Andrew: In research, everything starts with a question. What are we trying to answer? For this research project, we have partners with Stanford working with us as well, and everybody has their own set of research questions that drive how we put this stuff together. So we’re trying to find out, for example, how did ideas about cotton change over time? And were the newspaper conversations about cotton different in urban and rural places, and if so, what does that mean? Our questions are centered on Texas because that’s the collection that we have here at UNT.
Julie: So someone else may not be interested in Texas, but they can use your methods to find what they need. You’re creating a broader template that future historians can use.
Andrew: They probably won’t be interested in Texas (laughs). The location is the jumping off point. We have two goals. We’re trying to answer our research questions, and in the process we’re trying to develop a methodology that can be more widely applied. The Bill Lane Center for the Study of the American West out in Stanford, for example, wants to know more about the development of the West, post 1900. I’m interested in more 19th century Texas development. We use these questions as our focus to develop algorithms that can answer these questions and expose the patterns we’ve discovered.
Julie: How do you expose the patterns?
Andrew: (laughs) That’s the essence of what we’re going to try to figure out. We’re going to chop up all the words in our historical newspapers. We have 200,000 pages plus to deal with. That’s millions of words. In computer science — natural language processing, the field that Rada is so adept in — has been tackling these kinds of issues for a long time, i.e., how do we find meaningful patterns in large collections of information? This is what Google is interested in, so they can show better search results on the web; for us, it’s these newspapers. We’re looking for relationships between words.
So if you’re interested in how Abraham Lincoln was represented in the American civil war newspapers, you would chop up the newspapers and look for every instance of the word ‘Lincoln’ and you’d look at the constellation of words around ‘Lincoln.’ What is the word ‘Lincoln’ most likely to be associated with? What adjectives did they use? How did those change over time? How likely is it that the word ‘Lincoln’ would come up near the word ‘union’ or ‘slavery’ or these sorts of things. And then you’d try to identify what patterns emerge from that, which we don’t know yet.
If Lincoln shows up 90% of the time two words away from the word ‘slavery,’ it gives us a better sense of the way that Lincoln was being portrayed in newspapers. But it doesn’t explain what that means yet. The task of the historian is to make sense of the patterns we find. We’re trying to find associations throughout the newspapers, and then map those.
Julie: You can use these tools to refine the way you search, as opposed to vice versa, where you bring a set of words and assumptions to the table. Instead the computer shows you the word patterns …
Andrew: … and you refine from there. There are lots of ways to look at these patterns and the word associations. For example, it can tell me through the frequency of the level of association how strong or weak of an association it is. And all the newspapers have geography embedded within them, too. These are conversations happening across the landscape at the same time. So you have those patterns spread across the landscape and over time. That’s a lot of moving parts. With the geo-mapping, we’re going to be able to see how millions of words moved simultaneously. And I have no idea what we’re going to find at that point. But those are the kinds of patterns we’re going to try to expose. Language patterns and then how these relate geographically.
Julie: The notion of ideas moving over time and space is exciting. It seems like this layer of information could reveal a lot about culture. I would think the field of etymology could make good use of these tools. You could find out how the word ‘wild,’ for example, was used in sentences two hundred years ago compared with today … the results might reveal how people viewed themselves in relation to nature and how the collective cultural experience changes from one era to the next.
Andrew: Yes, exactly. And constructs do change; these things change over time. Historians deal with change over time and space. But it’s been hard to deal with such a large amount of information moving over a large area for a long period of time. It’s so complex. So we usually focus on one thing at a time and then compare 1890 to 1930, for example, and then fill in the gaps in-between. Now we’re trying to use the machine, essentially, to show us what that change is between those two points, and we can be much more precise about what we’re doing and what we’re looking at.
Julie: But it does seem that the result could be unexpected. That what you thought was the target you were tracking becomes this other thing … you discover new sets of data that tell you something else about the subject.
Andrew: I guarantee you that that’s going to happen. And I say this from experience. We had a project that we did at the University of Richmond called Voting America. This project deals with some of these same issues — the mapping side, not the text mining side. We took every presidential election since 1840 up until 2008 and we mapped all the presidential elections. The research question that we wanted to explain is “How did we get to be “red” and “blue” America?” We talk about politics in very geographic terms. If you know where someone lives you can almost guess their politics. (He points to the computer screen.) This is a baseline map of the way the Electoral College system looks at elections. Blue is democratic, red is republican … it’s fascinating to see the map change over time. Everyone is red here … everyone in the 1984 Electoral College map apparently loves Reagan. Everyone in the 1972 electoral map loves Nixon, apparently. What you see in many ways is that the Electoral College maps wash out everything, because 51% is represented as 100% on the maps. All contestation is washed out. And this is what we’ve gotten used to seeing because, of course, the Electoral College system determines the final results. The map we’re used to seeing is a red south and red mid-west versus the blue coasts, more or less. And we wanted to explain that.
I’m mentioning this is to illustrate the point you made about the unexpected. Which is … with the early maps we saw regional variations. When we visualized the election results at the county-level — which gives much more fine-grained results than at the state-level — what we saw in detail were things like the effects of Jim Crow segregation — the legacy of slavery and its aftermath (there was, for example, very little voting in the south). What’s amazing is that after WWII you stop seeing regional variations; they start disappearing, especially with the introduction of the Civil Rights Movement and the Civil Rights Act of ’64, and the Voting Rights Act of ’65, which knocks down state sponsored segregation and disenfranchisement. And then, “BOOM.” By the time you get to Nixon, Carter, Reagan … regional variation almost disappears from these maps. In part because it’s not nearly as strong or as strident in any region as what the Electoral College maps would have you see – so the bluest areas have red shot throughout, and the reddest areas have blue shot throughout; we were amazed that it really is purple America when you dig down to this level with the aid of the maps. This illustrates what you were saying, which is, the unexpected angle of all this stuff is what’s exciting. You come in with this perception, and hopefully it will be blown out of the water when you dig down to these other layers.
Julie: How did your collaboration with Rada come about?
Andrew: Bill Moen (UNT associate professor, Department of Library and Information Sciences) introduced us. And Rada’s name kept coming up through the text-mining work I was doing at the University of Richmond because she’s internationally known for being a leading figure in this field. So we talked about our projects, and we’re both very interested in seeing what text-mining can do for historical sources. And so the collaboration came together very nicely. The relationship with Stanford also came out of Richmond. We put together a grant for a conference there called Visualizing the Past, which was all about, ‘How do we visualize historical patterns?’ The best people from around the world came to Richmond and were experimenting with how we visualize patterns spatially. The Stanford folks came out, and we started talking with them about this stuff. They have a spatial history lab that they’re developing, and they’re very interested in text-mining and also in issues of the West. And UNT is at the forefront with our Portal to Texas History and our Digital Projects Unit, so we started building a research relationship with Stanford.
Julie: Cathy Hartman (UNT assistant dean, Libraries) got this grant to analyze historical Oklahoma City newspapers, with an interest in studying the Native American voice. Is her team interested in what you’re doing?
Andrew: Yes, we work hand in glove with them. They’re going to archive what we produce. Which is no small thing. The archive can then be available to scholars. We couldn’t do what we’re doing without the benefit of what they’re doing, which is digitizing this stuff in the first place. We work in a symbiotic relationship. Them producing millions and millions and millions of words has value because we can do this kind of scholarship now.
Julie: That raises an interesting question. The machine can only do so much. The success of this depends on humans and their ability to digitize and then log the data and keyword correctly — assign the right metadata.
Andrew: Absolutely. There’s no part of this that’s unmediated, which is why it’s scholarship as opposed to just tool-making. There’s that mediation involved.
Julie: But even with metadata, you can tag using every variation of word imaginable — but what happens to the concepts? How do these digitization teams rearticulate the concepts? Like captioning … this at least provides a quick summary of what the article is about. Perhaps, in the case of newspapers, the headlines are what provide this summary information.
Andrew: That is the challenge. That’s a great point. There’s an article by a guy named Dan Cohen called “The Raw and the Cooked” about digital history that deals with this issue … about what level do we go to with metadata work and things like that. What Cathy and her lab has decided to do is that they can either do a little bit of digitization very deeply and tag, metadata wise, or they could do an enormous amount and virtually not tag with much metadata besides very basic stuff. That’s the direction they decided to go: to digitize as much as they possibly can, but without tagging of any great depth.
The reason the lab over there is able to churn things out is they’ve managed to perfect so much automation of the process. In my early days, we did everything very deeply by hand to the nth degree as far as mark-ups go with text documents. But what we need for a project like this and to be able to ask these new questions is scale, more than anything else. And so, to answer your question … there is no deep tagging.
Julie: Still … what happens to the concepts? There’s a difference between captioning and tagging. With captioning, you’re reading the article and providing a quick conceptual summary of the contents in one sentence that the individual words themselves could never yield; of course this still reveals a bias. So is captioning an interesting thing?
Andrew: That’s something that people like Rada are very interested in trying to figure out … can we automate summation? Rada’s worked on things like book summarization, for example. And people like Google are interested because they have the Google book project, with 12 million volumes. Nobody would argue that you could produce an algorithm that can produce a better summary than a human could. And when you’re dealing with 10,000 items that no human can possibly go through, we need to figure out how to make the information more useful and do it more efficiently.
Rada is also working on things that show what the most relevant word in a collection may be, and that’s a harder thing to measure. Words like “the” and “uh” and “and” – they’re going to come up a lot. But are they relevant, as opposed to a word that comes up once? A word that comes up only one time at the end of a paragraph may have the most punch. Sometimes less is more in these situations. How do you measure those things? And that’s something computer scientists are dealing with and we, as historians, have to deal with, too. We have an opportunity to lay some new paths. It’s a little scary, ‘cause there’s no model on which to rely, but that’s the exciting side of this kind of research.
Julie: Edward Ayers, historian, your former teacher and now President of the University of Richmond, speaks about how digital scholarship can help give you perspectives on history from any number of angles — there’s not just one definitive text with one point of view. I love that you can flip the lens and look at the same situation from the slave girl’s point of view, or the wife of the plantation owner’s point of view. Depending on which filter you use, you get a different version of history.
Andrew: It’s gone from 2 to 3-D. Most documents give you what you see. But the ability to actually flip things around and see these multiple sides is a new thing.
Julie: It’s a boundary-dissolving tool. You realize there’s much more going on than that one interpretation, which can be very political.
Andrew: It also dissolves the boundaries between the public and the academic world. What we make available is far more accessible and gets consumed by far more people because it is digital and online. The average history academic book sells about 500 copies. That’s not very much. And hopefully libraries and institutions will purchase these books so that more people can read them. While we were working on the Voting America project at the University of Richmond, we got the attention of Google. Google came calling. And they said, hey, we would like to put your maps on Google Earth and Google Maps and get them out there. And we worked with them in a crazy 6-week period, translating all our stuff to put on line. They released layers in Google Earth and Google Maps of all our political data.
So this is what Google put out with us. (He turns to the computer to show the Google results.) We did this research in less than a year. We went from nothing to reaching millions of people in a short amount of time. That was the big take away. By having this online, it speeds up the academic conversation, the reach, the discussion in a way that books can’t do. I write books. I love books. I’ll continue to write books. But this whole wall (gestures to books on office shelves) is not going to reach that many people.
Julie: Have other people expounded on this research, do you know?
Andrew: I would love to see the research influence and inform something, but I haven’t seen it yet. In the academic world we move at a glacial pace, so maybe someone’s writing a dissertation on it. It’s hard to measure how it might be influencing others.
Julie: This idea that we’re not a red and blue America, we’re a purple America ... I think of those questionnaires that ask for your ethnicity — are you Caucasian, Asian, African American, etc. You might answer one way, only to find out from a DNA test that these boundaries are blurred — that you’re really part Cherokee and part Caucasian.
Andrew: Exactly. But we still talk about politics in regional terms; we’re sticking with what was true before the Civil Rights movement. And the Electoral College system helps us pretend that this other way is, in fact, the reality. The myth is easier than the explanation. The new information usually doesn’t sell very well because it’s more complicated. And people generally prefer not to deal with more complication.
Julie: It is interesting — this realization that the old paradigm continues to drive contemporary thinking. Based on this, the reflection of who we are is really 50 years old.
Andrew: It is. In the Voting America project we discovered that in national politics our language and our metaphors are from the early ‘60s. But the reality has changed very much. The implication, I think, is — that’s kind of dangerous — to be describing things in a way that’s completely inaccurate. We make decisions based on those assumptions. I was amazed because we walked into this project with that assumption. We walked in thinking we were going to explain how we got to that point. But as it turns out, we’re not even at that point. We’re somewhere else entirely. And that blew our minds. And every time we get to go out and talk about this stuff – it’s exciting.
Julie: It seems like it would still be a hard sell, though. That even if you have the visual data right in front of you, and the evidence is overwhelming in this other camp, people are still going to hold on to what they know.
Andrew: As a historian, I have a superstitious appreciation for facts. And so I like to revel in those as much as I possibly can. And all I can do is use those to help best answer the questions and try to show, whatever I’m trying to answer, where the evidence pushes me. And that’s what all this is about. Being as transparent as possible, and what comes of that, comes of that.
Julie: Well, thank you, Andrew!
Andrew: It was a fun conversation. Thank you.