COMPLEXITY: Physics of Life

Lauren Klein on Data Feminism (Part 2): Tracing Linguistic Innovation

Episode Notes

Where does cultural innovation come from? Histories often simplify the complex, shared work of creation into tales of Great Men and their visionary genius — but ideas have precedents, and moments, and it takes two different kinds of person to have and to hype them. The popularity of “influencers” past and present obscures the collaborative social processes by which ideas are born and spread. What can new tools for the study of historical literature tell us about how languages evolve…and what might a formal understanding of innovation change about the ways we work together?

Welcome to COMPLEXITY, the official podcast of the Santa Fe Institute. I’m your host, Michael Garfield, and every other week we’ll bring you with us for far-ranging conversations with our worldwide network of rigorous researchers developing new frameworks to explain the deepest mysteries of the universe.

This week we talk conclude our two-part conversation with Emory University researcher Lauren Klein, co-author (with Catherine D'Ignazio) of the MIT Press volume Data Feminism. We talk tracing change in language use with topic modeling, the role of randomness in Data Feminism, and what this work ultimately does and does not say about the hidden seams of power in society…

Subscribe to Complexity wherever you listen to podcasts — and if you value our work, please rate and review us at Apple Podcasts and/or consider making a donation at santafe.edu/give.

You can find numerous other ways to engage with us — including books, job openings, and open online courses — at santafe.edu/engage.

Thank you for listening!

Join our Facebook discussion group to meet like minds and talk about each episode.

Podcast theme music by Mitch Mignano.

Follow us on social media:
Twitter • YouTube • Facebook • Instagram • LinkedIn

Related Reading & Listening:

Data Feminism by Catherine D'Ignazio & Lauren Klein

“Dimensions of Scale: Invisible Labor, Editorial Work, and the Future of Quantitative Literary Studies” by Lauren Klein

“Abolitionist Networks: Modeling Language Change in Nineteenth-Century Activist Newspapers” by Sandeep Soni, Lauren Klein, Jacob Eisenstein

Our Twitter thread on Lauren’s SFI Seminar (with video link)

Disentangling ecological and taphonomic signals in ancient food webs” by Jack O Shaw, Emily Coco, Kate Wootton, Dries Daems, Andrew Gillreath-Brown, Anshuman Swain, Jennifer A Dunne

More resources in the show notes for Part 1: Surfacing Invisible Labor.

Episode Transcription

Transcript by http://podscribe.ai (machine draft) + Aaron Leventman (human edit)

Lauren Klein (0s): Anyone who works with data knows, data is created by people. Data are created by people, if you want to be pedantic about it. And because they are created by people, they necessarily involve choices about what to count, what not to count, what to classify, what not to classify, what to include, what not to include. And then there's also a whole separate set of issues about the impact of the data. What is the capture of this data? Who does it impact? Who does it potentially harmed? What is the analysis of this data?

What is the impact? Who does it potentially harm? You need to be thinking about human impacts, always.

Michael Garfield (60s): Where does cultural innovation come from? Histories often simplify the complex shared work of creation into tales of great men and their visionary genius, but ideas have precedence and moments, and it takes two different kinds of person to have and to hype them. The popularity of influencers past and present obscures the collaborative social processes by which ideas are born and spread. What can new tools for the study of historical literature tell us about how languages evolve and what might a formal understanding of innovation change about the ways we work together. Welcome to Complexity, the official podcast of the Santa Fe Institute. I'm your host, Michael Garfield, and every other week, we'll bring you with us for far ranging conversations with our worldwide network of rigorous researchers, developing new frameworks to explain the deepest mysteries of the universe. This week, we conclude our two-part conversation with Emory University researcher, Lauren Klein, coauthor with Katherine D'Ignazio of the MIT Press Volume Data Feminism.

We talk tracing change in language use with topic modeling the role of randomness in data feminism and what this work ultimately does, and does not say about the hidden seams of power in society. Subscribed to Complexity wherever you listen to podcasts. And if you value our work, please rate and review us @applepodcasts and/or consider making a donation at santafe.edu/give. You can find numerous other ways to engage with us, including books, job openings, and open onlinecourses at santafe.edu/engage.

Thank you for listening. I remember when you spoke for your SFI seminar, I asked if you were applying this kind of modeling approach to contemporary corpora, trying to trace otherwise invisible contributions of like the lower decks uncelebrated employees in modern workforces. And similarly I'm reminded of work that Jack O’Shaw, who just joined us as a fellow has done with our VP of Science, Jennifer Dunne, on trying to use food, web reconstruction to identify gaps in the fossil record.
 

Clearly something isn't fossilizing here because we can see that there's an enormous metabolic contribution coming from this lacuna in the food web. Could you talk a little bit about this because I think we've already touched on some of the stuff that you explore in this paper, but I'd love to hear you unpack it a little for us.

Lauren Klein (3m 40s): So this paper, it's good that it comes out of this conversation about slow academic scholarship, because this is one of the end results of a very long collaboration between myself and Jacob Eisenstein. We both arrived at Georgia Tech the same year in 2013, and he's now at Google, but it took almost that entire time in order to identify the corpus, prepare the corpus, cleaned the corpus, clean it again, figure out what methods would yield, things that were meaningful, but the more conceptually speaking, this is another project that tries to get at some of these themes again, really sort of bring together a lot of the work that I do, which has to do with unacknowledged labor, with the paths by which change and political change in particular takes place. And then who was rewarded or recognized for that change in the outset. So what we actually do is we go back to this newspaper corpus, the same that I used for that earlier paper, but we look at language at the level of individual words, and we try to find who or which newspaper was responsible for innovating a change in meaning, a new meaning, but more likely as we found a change in meaning of a particular word.

So you could think a little bit, how does one of the things that I'll give an example from the paper? So the word justice at the beginning of the corpus in like 1820 really has this narrow legal definition. If you look at the words that are most similar to the word justice, you get discussions of laws and rights and the political and legal system, but by midway through the corpus, and certainly by the end, there's been this inflection where justice has taken on these more capacious connotations, like citizenship, like rights, like freedom.

This is really interesting and it's compelling again because it tracks what I was talking about before a lot of the work that's been done to show how the abolitionists movement wasn't only about achieving legal emancipation for people who have been enslaved, but we're also redefining what it meant to live. And so this idea of justice, what does justice mean? It doesn't just mean like, okay, do you get this sentence for your crime or the other sentence you get a disinterested judge or not. You're like, it doesn't just mean that justice means, you know, everything actually that when we say like social justice, now, that's what it means.

And this change started to happen in the 19th century. So the first thing it did, this change happen? Did it happen with any degree of statistical significance? That was the first thing that we set out to do. And we developed a process that looked at word embeddings actually, and how word embeddings changed over time. And also as they were associated with individual newspapers to be able to say, okay, and then we did like a ton of error checking on this. So we ran a bunch of permutation tests.

We came up with a whole bunch of synthetic data. And so in the end, the words that we said, like these definitely changed in our corpus with significance. We believe that now. And so first we found what these words were and some of them were some of these really ideologically charged words like justice, like freedom, not interestingly abolition, but I think rights was another word that sort of acquire these sort of more conceptual balances. But then beyond that, we wanted to find out who is doing the changing. If you subscribe to this belief, that words change by an annunciating them, whether it's by speaking or by using them in prints, they change because of how people use them.

Then we wanted a way to figure who was doing, who was leading in these changes, who would introduce a change meaning. And then clearly if it was widely adopted by the general public or in our case, this particular corpus, like who was the next to pick it up. And so we actually built on a measure that said  he actually just got his PhD from Georgia Tech. He's moving on to Berkeley. He developed for another paper from a totally unrelated context about innovative language and citation counts. And he actually developed a method that showed that people who use innovative language in papers actually have higher citation counts.

But what he was able to do is look at the overall embedding for any particular word. And then look at the embedding for that word that was associated with a particular newspaper and say, okay, if at a certain point, the specific embedding for that newspaper takes over as the one that predicts what the other newspapers, how the other newspapers will use that word. Then that's a sign that this newspaper is leading the change for this particular word.

And he was able to set it up so that we looked at every single word, every single newspaper, every single time period in the corpus. And then we pulled out what we talk about in the paper as the leaders in a semantic changes and then the fast followers of these changes. And so I guess the one last thing that I'll say about that is it actually did show us some interesting things and it's related to the conversation we were having earlier. So Lydia Maria Child, again, the editor of this like fairly mainstream abolitionist newspaper, where she was brought on board to be more moderate, to temper the movement, to bring more people who might've been a little bit hesitant to join the cause because they perceived it to be too radical.

Her tenure as editor of her newspaper is called the national anti-slavery standard. Ashley was characterized, she was the editor who was quick to adopt the most number of words. So this is an indicator that she had her eye on the discourse, but she never innovated. But as soon as she saw that there was this interesting new development, she would jump on it. And so that was a really interesting finding. And interestingly, the newspaper that innovated in the most number of words was the liberator, which was the more radical newspaper edited by William Lloyd Garrison that he and she together recognized as like too radical to appeal to most people.

So they let him innovate and she was fast to follow. And so we found in this quantitative analysis confirmation of the major dynamic that has been remarked upon in the qualitative research. But then again, the other really interesting thing, we found Mary Ann Shad, who I was talking about before, this black woman editor who was known, or at least felt that she was being sort of unfairly maligned for her radical views. Lo and behold, not only did she innovate ideologically, but she also innovated at the level of language. So she had many more new word usages that then other newspapers of its ilk.

And then the other really interesting finding and interesting in this way by kind of like complicated and potentially bad is that we added in some non -abolitionist newspapers in order to sort of get some outside context. So we added in some general audience newspapers and some women's suffrage newspapers because the women's suffrage movement, as I think most people know, but if not like it sort of ran alongside the abolitionists movement at some times when it was expedient, they sort of dovetail it in their aims. And then at other times, as we know now from the historical record, the white women intent on their own political franchise were quite actually racist and were very clearly willing to put their own political enfranchisement ahead of black people's actual freedom.

Found in spite of the racism that really came to characterize a women's suffrage movement, they actually innovated in a lot of words in this abolitionists discourse. And I'm still sitting with that, trying to figure out what it, well, I think we know what it means, is that we know that there's a disconnect between what people say and what they do. And there's some interesting, like broad social theories of socio-linguistics and broad findings that have shown that like women tend to innovate linguistically more than men do.

This has been proved longitudinally through oral interviews hand annotation of data and things like this. It also may have something to do with the range of what are called, like women's conversations versus the narrowness of what men feel compelled to speak. I mean, there's like weird broad strokes that you could paint. You could sort of get like a broad strokes explanation for this. But I actually find that pretty unsatisfying. To me, it really shows the, like I said before, this disconnect between people who know what the politically expedient or politically correct position is to voice.

And then the people who write about it in the newspapers, but then the gap between that position as voiced and to how they actually follow through with their actions. This is the grand challenge of political change. It's that so many more people are willing to say, for example, black lives matter, but then like, what are you going to do about it? This is, I think what academia and the country has been grappling with over the past year. The first step is to recognize that we have tremendous racial inequality in our country and the world. But are you willing as a white person, as a white scholar, as a white academic, what are you going to give up so that we can rebalance this. That's harder to do.

And so, anyway, so I feel like I actually would like to spend some more time digesting those results because I think they're pretty interesting.

Michael Garfield (12m 52s): There's a lot there. Well, one thing that I, I just want to attend to, and maybe we don't need to linger on this, but it's, it's fascinating is working in social media I am just woefully disappointed with the measures by which we determine influence in society generally, somebody is an influencer because they have a million followers, but like you just said, I actually, a lot of the people are actually just popularizes, they're not really innovating anything.

And so it seems like we need to draw on research like this in order to better understand the points of leverage. It's easier to notice that the second person on the dance floor is the one that gets everybody on the dance floor, but that kind of thing. It's like right now, the incentive structure is such that if you are one of these folks, that's listening carefully, that's casting a wide net. And then, like popularizing a hashtag, Luckily we're capable of tracing things now to identify who actually created the hashtag.

But like that person historically is not rewarded. And even now it's banal, but like the people that are doing the innovation are not necessarily the people that are being approached by marketing agencies, because they're not the ones with the enormous audience. And so that touches back to comments that you'd made a moment ago about the incentive structure in academic research. How does somebody doing truly innovative research get funded for it. You need people that are capable of seeing things in the way that you're, you're talking about them, but actually, one of the things I'd like to explore a little bit more with you, something kind of curious that you touched on here in the results of this paper, about how justice goes from this concrete meaning to a much more abstract meaning, but in other places in the corpus, the word equality goes in the opposite direction.

The word freedom goes in the opposite direction. And it makes me wonder if there's a pattern that we can observe here that is related to patterns in evolutionary dynamics about the pressure to move from like a generalist to a specialist or back, under what contexts is a messy, broad interoperable strategy better. And in which cases do you want something that's really like precise and attuned to a particular context.

And, and I'm curious how you understand these two different trajectories in the work and what you think it means about the evolution of language under certain discursive constraints and so on.

Lauren Klein (15m 45s): That's a really interesting observation. And to be honest, I hadn't really thought about that before. I mean, one can see how certain external pressures might make a word need to be used in a more precise way than it had been in the past. And I should say that one of the things that we were pretty attentive to, and I'd say, this is mostly the influence of Jacob, was really trying to not make claims that extended beyond the corpus. So we actually we never used the word influence in the paper because we didn't want anyone to impute any sort of causal relationship between anything that we were implying.

We just wanted to say, within this corpus, this is what's going on, because there's too many exogenous influences, there's too many things that could be actually at the root of some of the changes that we're seeing. All we really wanted to say was like, when this corpus, which honestly has a lot of the big names in the newspaper space at this time, but not all of them. And that's what we wanted to do. I think that what I'll say about how language works and this isn't really an answer to your question, but it is another interesting finding that sort of prompted one of the things about this paper is there were a couple of steps in the research process and those corresponded to steps when in our own analysis we thought we will be done and we would have enough to say, but like there kind of wasn't enough there.

Or we found a surprising result that we wanted to dig a little bit deeper into. And one of those results had to do with the fact that at some point in the process, we had the embeddings, we came up with the words, we had a ranking, there was a just because of the metric that Sandeep developed with respect to leading and following, we could rank them according to the strength like this word was one that very, very clearly it was one that changed to a high degree. And this one changed to a little bit less of a significant degree. So we ranked them when we looked at that we actually hand read through the top 2000 words that had exhibited these changes.

And not very many of them were these sort of ideologically pregnant words, or like these keywords as you might call them in the Raymond Williams approach to culture, most of them were fairly common words, fought and struggled and hope and wish. And they were like adjectives and verbs. And yet they've remained in this list of highly significant word changes once passed through all the different error checking that we did.

And so there has to be something here and what this pointed to the fact that the vast majority of these words were sort of words at the level of discourse rather than explicitly ideological keywords, was that what was changing in these newspapers was the way that people talked about political change. And in some ways, and this sort of gets at this idea of invisible labor that I was talking about in this other paper, you can't drill down beyond the level of discourse. You have to stop there.

Like that's the base, that's the level that you can get at say that, I think what we're capturing is how this discussion took place, and we're just going to have to leave it there and trust that it did change. And that certain actors did lead in introducing changes about how well or how these certain concepts were discussed. And this is actually in the paper, what led us to try to take this aggregate view, where we decided, instead of trying to, again, like drill down into the meanings of these specific words, which I think resist any specific or indexical relation to some sort of abolitionists concept, just to say like, okay, like if this is at the level of discourse, then maybe we need to look at the level of discourse and do some aggregating.

This was actually what led us to come up with some of the network structures that we end use and we present in the paper that led to these conclusions that I was talking about earlier, about how certain newspapers, like the Liberator, or like the Provincial Freeman where these real leaders in terms of language.

Michael Garfield (19m 42s): So two more questions that I hope that we have time to discuss on this. And one of which is about the role of a randomization in this work. You mentioned that in the, in the first paper we discussed, the sampling process relies on random selection. Therefore the model yields a slightly different set of topics. Each time the code is run this aspect of topic modeling inference is important to acknowledge. And then in this paper, you explain that this is necessary for a number of reasons, and that to touch back on the comments that we'd made about honesty in journalism.

And so on that this is a way of trying to help wash out bias both in the data and in the approach that you're taking to it. So I'd love to hear you give a little bit on that.

Lauren Klein (20m 36s): Randomization enters the process in, or it in two very different places in those different papers. So how topic modeling works is you begin with a random allocation of words into topics, and then through an iterative process, those topics get refined. And so the idea is ultimately they converge on stable topics, but you always need to begin with a random allocation. That's just how LDA topic modeling, just how it works.

So two things, one is that while the topics begin with a random allocation of words and is also true, and anyone who's worked with topic modeling knows this, they do ultimately because topic modeling as an algorithm make sense, they do ultimately always converge on similar topics. And so like if you run a topic model over a corpus, a lot of times it's like in this for instance, this abolitionist newspaper topic, corpus of like, I've run probably hundreds of topic models on this.

Like you always do sort of end up with like a cooking one and abolition one and a war one, like they're all roughly the same. But what this also means is that it's a caution to primarily humanists who have a tendency to close read the meaning of individual words. You can't necessarily say, oh, because this specific word ends up as a most significant word in this specific topic. That word has unique significance within the topic. Really what I was just trying to say in that paper is we need to take the topic as the smallest unit that we can, that we can analyze.

Maybe it's below that actually similar to this question of discourse, like below that the specific words that comprise the topic are slightly unstable underneath. In this paper what we used randomization to do was as a caution against spurious correlations. Like it could have just been that there was, like one newspaper published a lot in a very short frequency on a specific topic, not topic model, but like just like issue. And therefore there was an inflection, there was a change in the word, but really it was the result not of like general word uses usage, but just because there was this very specific issue that happened to involve the use of the specific word in the specific newspaper.

So we wanted to guard against that. We also have the general problem of these newspapers themselves are being very sporadic or uneven in their publication history. So we had some newspapers that publish the entire length of the span of our corpus from the 1820s to 1865, where we kept it. We had others which popped in and out for a year or two. We had some which published weekly, some monthly, some which did huge issues of 20 pages, some which only had like a bi-fold or broadside kind of print. So I think we really wanted to make sure that we weren't capturing anything that wasn't actually related to the change in word usage of the specific words.

And so what we essentially did, I'm trying to think of the specific places where this came into play. So one of the things that we did was we created a totally I'll be created synthetic data. So we ran it, we took the same words and actually, I want to make sure that I'm getting this right, but I think we took the same words and we assigned them to different newspapers, randomized newspapers at random time periods. And the idea was that if you ran the same model, you would not get the same measure of this word change because the newspapers have been read the, the source of it happened randomized.

And instead you would see no correlation or you see no change because it was random data. And then we ran through this, I believe he did it like a hundred times. We said, okay, well, what words routined there the significant change when compare it to the synthetic data among all of these different permutations of the data that we ran. And so that was a way that we were trying to guard against false conclusions from the data. And I guess it all leads to the end, which is don't draw false conclusions from the data because you don't understand how the process works, but it was two very different deployments of randomization.

One at the very beginning versus one at the end to try to guard against us.

Michael Garfield (24m 44s): I'm kind of reminded of the conversation I had with Peter Dodds of the University of Vermont on the show, when he's talking about looking at Twitter and how the approaches that he took to Twitter gave a much deeper understanding of the way that different terms are taken up into parlance and the life cycle that they have in society compared to Google Ngrams, which is only looking at the frequencies of the words in a given subset of all books published, and grams is not doing the kind of work that you're doing in tracing the networks of adoption, which is something that you can do on Twitter.

But then, that's actually not to the point of randomness. That's just to the point of why you must randomized so that you're not conflating suddenly this appears in all of these books or disappears with that meaning that it actually is the people may not be reading those books and they may not actually be talking about.

Lauren Klein (25m 47s): But Benjamin has done some great work on this showing how the artifacts. So because Google Ngrams comes from major university libraries and he talks about how you can actually find some interesting textual artifacts that show up as spikes and like that standard Google Ngram chart, they actually just have to do with either major acquisitions decisions on the part of a single library or a cataloging decisions. I know he looks at, for instance, the zip code of Cambridge, Massachusetts, 0 2 1 3 8. And there's this tremendous spike that you can find when you do the Google and grants with graph and he's like, no, Ashley, this is just when Harvard digitized all of their books pre a certain dates and dumped them into Google Ngrams.

And because it says property of, or acquired by Harvard University libraries or one or whatever, like that's where it shows up, but his work, he actually does amazing work in general with these like data artifacts or the artifacts in the data that actually show you a lot about the process of data creation, as opposed to what you actually thought your data was captured.

Michael Garfield (26m 48s): So from there, I'd like to land this conversation with a question about this, particularly like looking forward and how a data feminist approach grants some insight or offers some strategy in terms of our own writing and our own content creation broadly, how the media that we produce now, we can kind of reframe the process of how we produce it and how we archive it and preserve it for future analysis, and how we might be able to build on or apply your research to coming up with a more equitable database for future historians and people that are looking to use a more accurate read of history to provide better results in terms of resource allocation, economic policy, et cetera. How do we leave a better record for the future so that they can make better decisions?

Lauren Klein (27m 56s): It's such a good question. And I actually think it's a hopeful one to end on because I actually think there is some pretty clear solutions here. I think this is something that Catherine and I say and data feminism, but I think anyone who works with data knows, data is created by people. Data are created by people, if you want to be pedantic about it. And because they are created by people, they necessarily involve choices about what to count, what not to count, what to classify, what not to classify, what to include, what not to include.

And then there's also a whole separate set of issues about the impact of the data. What is the capture of this data? Who does it impact? Who does it potentially harm? What is the analysis of this data? What is the impact? Who does it potentially harm? Like you need to be thinking about human impacts always. And I think when you start there, and again, this is with an eye to ultimately doing not just more valid research, but more ethical research and more enduring research. But when you think about that, there actually have been some really interesting sort of ally projects that have come out both from the CS space, also from the data journalism space.

And also just from like the open data movement, more generally, which have to do with documenting the context surrounding a dataset so that it can be passed on if it should be passed on, in more thorough ways than I think simply just recording some basic metadata about a data set and then depositing it since some sort of OER open institutional repository type situation. So I'm thinking here, there's this great paper, Tim Gebru was the lead author, but it's a multi-author paper called Data Sheets for Data Sets.

And all this is just a series of very thought out questions about data provenance, data collection, potential impacts, test potential ethical issues. Just making sure that even if you yourself might have thought to ask this question about a dataset, you should be asking these questions before you proceed at any phase of any data related project, whether it's collection analysis, communication, compiling, and future sharing of the data set. And so on. Heather Krauss has this great, a little bit more simplified approach called the data biography, which just involves asking like who, what, when, where, why, how about your dataset and the way that journalists know how to sort of ground truth, other sorts of bits of evidence that they have. Bob Gradeck is someone involved in civic data.

And he has this idea of the data user guide, but all of these approaches are roughly doing the same thing, which is to try to give more context around the data, try to build more awareness of on the one hand, try to make it more useful in the future, but also to be aware of the potential harms of these data, if they get in the wrong hands, or if they're used or abused by some sort of corporation or other institution that doesn't have the interest of the people or the issue that the data is documented, that doesn't have those best interest in mind.

Michael Garfield (30m 57s): Well, before we just close out on this, I would just like to due to the nature of this conversation, if there are any contributors agents to this work, my questions we have failed to ask any people we have failed to credit, or maybe not even human people, but just, you know, agencies and influences. Who should we not leave invisible before we sign out?

Lauren Klein (31m 28s): That's a really a great question. I appreciate your asking it. I think that I've named all of the coerced authors on all of the papers that we've talked about, but I will say that a lot of this work, most of this work has come up through the Digital Humanities Lab that I run now at Emory, but for a lot of years at Georgia Tech and that involved a lot of undergraduate students passing through for semesters and years and durations of their undergraduate careers, experimenting with this data, helping me think through issues and just sort of generally working towards these kinds of publications that you see and it's tough, again, this is sort of invisible labor because it's not directly visible.

In these final papers that were published and yet the work could not have happened without them. And actually I'll just say, watch the skies I'm working on this project. Hopefully it will be really soon on the history of data visualization. And one of the figures that I talk about in this project is W.E.B. Du Bois. And he, I think somewhat famously at this point involved a lot of his students in the production of his data visualizations. And so one of the things that I'm trying to do is think of a visual way that we can sort of testify to these students' even if they aren't named in the output itself.

So yeah, maybe I'll maybe I'll end there and just say, yeah, thanks for this great conversation. I really appreciate it. And I appreciate the space to talk about labor at the variant.
 

Michael Garfield (32m 52s): This has been a treat. Thank you so much for taking the time. Thank you for listening. Complexity is produced by the Santa Fe Institute, a nonprofit hub for complex systems science located in the high desert of New Mexico. For more information, including transcripts research links and educational resources, or to support our science and communication efforts. Visit Santafe.edu/podcast.