| | | |

Flashback: Viral Texts, Deduplication, and the Star-Spangled Banner

This project is a little harder to describe. For one thing, I’m still technically part of the Viral Texts Project but my focus has shifted to the Virality of Racial Terror (link forthcoming)–a different part of the work and a different project in a lot of ways. I’ll report about that work in future posts. What’s important to understand now, though, is that it wasn’t my first role with Viral Texts.

In my first role, I started what was kind of an aimless endeavor in the beginning as I tried to find a way into Viral Texts. I was a baby PhD student, still completing coursework and teaching at the same time, and Viral Texts, as you may know, was/is a longstanding DH project that already had many members and subprojects within it. So, frankly, I struggled to figure out what the hell I was doing.

But over time, some of the work came into focus. I didn’t produce exhibits or articles, but I made strides in understanding the core computational elements of Viral Texts. In particular, I got a good handle on Passim, the text reuse library developed by David A. Smith. There’s a lot of resources describing how to deploy Passim as well as publications on its algorithmic structure. If you want to know more, I suggest reading those resources, but suffice it to say that Passim allows users to identify reused or duplicate text in corpora. Some of its core logics can also be deployed without using the library specifically.

Anyway, once I understood how to use Passim and, separately, how to construct corpora of newspaper text by scraping the Chronicling America archive, I suddenly found myself capable of doing real Viral Texts work. I began to explore deduplication–the removal of reused texts from corpora, a common preprocessing step in NLP. I began to discover how impactful effective deduplication could be–not just for frequency analysis, but also embedding analysis and language model training. I tested various deduplication methods, including using Passim and one of its core logics in “shingling” texts into overlapping 5-grams.

To demonstrate how important proper deduplication could be, I began scraping the Chronicling America archive for a corpus of references to the Star-Spangled Banner. The idea was that I’d be able to use this corpus to compare deduplication methods by removing general references to the Star-Spangled Banner while maintaining full-text reprints of the song, and vice versa, before doing some embedding analysis which showed varied results based on what had (or hadn’t) been deduplicated. This was also interesting because it was revealing a textual history of the Star-Spangled Banner. For example, the song seems to have been reprinted in full more frequently during the American Civil War:

This fact is made more intriguing when you consider that the Star-Spangled Banner wasn’t the National Anthem at the time. It didn’t become the National Anthem until 1931. Yet it was always a frequent song published and performed in the United States, quite often during July at Independence Day celebrations:

How it was referenced in the newspapers–and how the phrase “star-spangled banner” became a common synonym for the American flag–were revealing an intriguing story about how the song became a National Anthem. That dimension of the project, plus its potential technical interventions, were beginning to feel promising! And I was having fun and building confidence with NLP methods and newspaper corpora.

But then, I got the call from my PhD advisor, Dr. Cordell, who was hoping to bring me onto the Virality of Racial Terror Project (link forthcoming).

This was good news! But it meant that this little dedup/Star-Spangled Banner project would be sidelined. Perhaps the banner will yet wave, but in any case, I thought it’d be good to archive the work via this post. Should I revitalize the effort? Since 2023, Viral Texts has made strides in making its data available, which would certainly help this project, so yes, the answer is yes. But we’ll have to wait for a little lull in my academic and creative work. As future blog posts will show, I’ve got a few other things on my plate in 2026/2027.

Leave a Reply

Your email address will not be published. Required fields are marked *