| | | |

How I’m Backing Up Chronicling America’s OCR Text, Just in Case

I’m going to get straight to the point: this post is about how I’m backing up Chronicling America, the Library of Congress’s database of digitized newspapers. Why? Because I have concerns about its maintenance and accessibility given the Trump administration’s dangerous policies toward our cultural heritage institutions. I count on Chronicling America as a research assistant. I’m also planning to utilize Chronicling America for my dissertation and numerous other research projects in the years and decades ahead. Who knows what will happen, but I’m backing up Chronicling America’s OCR text because, quite frankly, I don’t have deep trust in the current administration’s ability to see its value and the necessity of maintaining it.

My concerns aren’t a just hunch or general anxiety, either. The Trump administration has already shut down the National Digital Newspaper Program (NDNP). This program awards NEH funding to state and local institutions in exchange for the digitization of their newspaper archives. These digitized newspapers are then added to Chronicling America. In other words, Trump’s NEH grant cancellations have already stunted the growth of the database. At the same time, he fired Carla Hayden, the Librarian of Congress, and replaced her with one of his loyal toadies, a person with absolutely no experience in librarianship. The full consequences of this maneuver are not immediately clear to me, but it doesn’t inspire hope for the Library of Congress. If there are continued shakeups, threats, or destabilizations, I fear for all of the resources at the Library of Congress, including Chronicling America.

So, if you, like me, rely on Chronicling America for your work and you don’t want to leave the fate of your work to the whims of the current administration, then maybe this post will help. I’ll break down how I’m backing up the database (just the OCR-generated, full-text files to start1) with the goal of making the whole process replicable for others. I won’t lie to you, though: it’s a computationally and financially expensive project. It relies on some Python know-how and the appropriate hardware. It takes several days with constant internet connection. It’s no small feat–we’re talking about downloading over 20 million pages of newspaper text onto a local device. But we’re living in strange times that call for drastic measures. I’m not taking this data for granted. I’m sharing my process in case others feel the same way and want or need to create their own local backup of the Chronicling America OCR text.

What You’ll Need

Before you start downloading the OCR-text files from Chronicling America, you’ll need a few things. Those things are:

  • A storage device with at least 3TB of space

In my case, I’m working with a Samsung Portable SSD with 8TB of storage. I would suggest far more than the minimum 3TB of storage. That extra space becomes critical when you want to search and interact with the database on your device (more on that in a future post). But if all you can afford is a 3TB SSD, it’s technically enough to hold all the necessary files. If you’re following along with my process, however, your interactions with those files may be limited without more disk space.

  • A computer you can leave running with constant internet connection for 3+ days

The faster the internet, the better. For me, the whole runtime was about 80 hours. I was working with good fiber internet and a 2023 MacBook Pro. If you have slower internet and/or an older device, it will take longer.

  • A Python environment to run the code

I’m providing all the necessary code in this post. You can also refer to my GitHub repository. If you need help getting anything to run, contact me. I can’t promise I’ll be able to help, but I’ll try my best.

Things to Consider

The first resource to know and understand is Chronicling America’s batch file directory. These batches are the files of digitized newspapers submitted to Chronicling America by various institutions. They’re provided as tarball files (a kind of compressed data storage file). When you download and uncompress one of these tarballs, you’ll find a directory with newspapers organized by Chronicling America’s SN codes (the unique identifiers for each newspaper), years, months, dates, editions, and sequences (newspaper page numbers).

The batch file directory contains all these tarballs up to those submitted to the Library of Congress by March 2024. There have been subsequent additions to Chronicling America since then, but their updates to the batch file directory seem to end there. In my process, I’m only working with these batch files. If you need more recent additions to Chronicling America (anything added after March of 2024), you’ll need to download them some other way. I’d look to their API documentation as a starting point in that regard, but please know my process only archives Chronicling America OCR up to March 2024.

To run the code, you’ll also need to use ocr_batches.csv, a reference file I created. I won’t go over this csv file in great detail here2. Just know that it’s the csv file you’ll need to run my code. It also contains a ‘contents’ column that tells you what newspapers and years are contained in the given tarball. This means you can use ocr_batches.csv to navigate the database once its fully downloaded to your local device.

Downloading Batches

In a nutshell, my process uses well-documented Python libraries (Requests, Pandas, OS, Time, Tqdm) to loop through the ‘file_name’ column in ocr_batches.csv, adding the file names to a base URL to download each batch of Chronicling America OCR. It also implements delays so you don’t run into Chronicling America’s rate limits. This is the main reason why it takes 3+ days to run.

Assuming you have everything you need (listed above) and the necessary Python libraries in your environment, then the following code should run for you3. The only thing you’ll need to change is the output_directory. You need to set it to your external hard drive or SSD (see comments in code below).

Python
import requests
import pandas as pd
import time
from tqdm.notebook import tqdm # This best if you're working in a Jupyter notebook. Otherwise, try 'from tqdm import tqdm'
import os

ocr_batches = pd.read_csv('https://raw.githubusercontent.com/MatthewKollmer/chron_am_backup/refs/heads/main/ocr_batches.csv')
base_url = 'https://chroniclingamerica.loc.gov/data/ocr/'
# be sure to set your the output directory to your own external hard drive/SSD with at least 3TB of space
output_directory = 'CHANGE/TO/DIRECTORY/PATH/OF/YOUR/CHOOSING'

def pull_tarbiz_files(file_name):
    url = f'{base_url}{file_name}'
    output_path = os.path.join(output_directory, file_name)

    if os.path.exists(output_path):
        progress_bar.update(1)
        return

    try:
        with requests.get(url, stream=True) as response:
            if response.status_code == 200:
                with open(output_path, 'wb') as file:
                    file.write(response.content)
                    
                progress_bar.update(1)
                time.sleep(60) # This adds a minute in between each download to respect Chron Am's rate limits. You could try to shorten this timespan to speed up the whole process, but I was still getting 429 errors with time.sleep(30), so I just bumped it up to a minute, which kept things running. 

            elif response.status_code == 429:
                print('Received 429 error. Sorry Chron Am! Waiting 1 hour before retrying.')
                time.sleep(3600) # Idk if Chron Am bans IP addresses, but just in case, better back off for an hour if you somehow get a 429 error!

            else:
                print(f'Sumpin went wrong downloading {file_name}: {response.status_code}')
                time.sleep(5)

    except Exception as e:
        print(f'Exception issue with {file_name}: {e}')
        time.sleep(5)

Once you’re ready to let it run, you can then loop through ocr_batches, as I do in the last snippet of code below. The progress bar will tell you how far along your are. It should also be fine if you need to pause the code. The first conditional statement in the pull_tarbiz_files() function (where it says “if os.path.exists(output_path):”) checks to see if the given tarball already exists in the directory and moves on if it does, meaning you won’t have to start over if you need to pause things, or if your connection breaks.

Python
# Heads up: this code runs for a long, long time. On my device, it took about 80 hours. It's okay to stop and start it again–the first conditional statement in the pull_tarbiz_files() function (where it reads: "if os.path.exists(output_path):...") skips over files that have already been downloaded, so it's fine if you want to run the code for a while, stop running it, and then start again later. You'll just pick up where you left off.
progress_bar = tqdm(total=len(ocr_batches), unit='file', desc='Batches Downloaded', mininterval=1.0)

for _, row in ocr_batches.iterrows():
    pull_tarbiz_files(row['file_name'])

progress_bar.close()

Navigating the Batches

Once you’ve downloaded all the batches, you’ll probably need to keep them compressed (unless you downloaded them to an especially massive storage device). I don’t know exactly how much disk space it would take to unpack them all, but it’s a lot–probably 5x their compressed size of 3TB. You should therefore unpack them only when needed, which means you’ll need to navigate their contents before opening them.

To do that, you can use my reference files: ocr_batches.csv and newspapers.csv. ocr_batches.csv has a row for every tarball file. It also has a ‘contents’ column with sn codes and years of newspapers contained in the given tarball. newspapers.csv, on the other hand, has a row for every unique newspaper and a column with any tarballs containing that newspaper. If you want more info on these reference files, check out this notebook describing how I put them together. They should facilitate the navigation of the database without having to unpack the tarballs. Also, just FYI: I’m working on some automated ways to do that. They involve the ‘tarfile’ module in Python which lets you read and extract things without fully unpacking the file. Once I have my own processes figured out, I’ll post again and provide the code. But for now, rest assured: you have a local backup of Chronicling America’s OCR text and two reference files documenting what’s in your local backup.

That’s all for now. I don’t want to wait any longer to disseminate this post in case others need or want to do the same thing. I’ll just close by saying that I sincerely hope I never have to use my local backup of Chronicling America’s OCR text. I hope the Library of Congress manages to weather this storm. But if not, I at least have a version of one of its best resources to carry forward.

Footnotes

  1. I’m focusing on the OCR-generated, full-text files for two reasons: 1) they are immediately important to my work in cultural analytics and text-mining, and 2) I lack the funds to acquire the appropriate hardware for downloading the image files. Sadly, the OCR-text files do not represent the full scope and value of Chronicling America. Ideally, I’d gather the pdf files as well, but according to my rough estimates, doing so would require something like 160 TB of storage space, thousands more in hardware, and several months of runtime. Or at least that’s what it would take with my limited abilities as a programmer. But in any case, the OCR-text files are more manageable and pertinent to my work, so that’s the scope of this project for now. ↩︎
  2. If you’re interested, see this Python notebook for how I constructed the ocr_batches.csv. ↩︎
  3. Just full disclosure: I often collaborate with LLMs when I code. I use them to troubleshoot errors, learn about libraries or modules I’m not familiar with, and fill in boilerplate code. ↩︎

Similar Posts