The Rare Books Feed: Breakdown of a Content Algorithm

Introduction

About two months ago, I created the Rare Books feed on BlueSky. Ever since, I’ve been thinking that I need to release the specifics of the feed (to publish my algorithm)–but then, I didn’t have an outlet for those sorts of side projects, curiosities, or longform updates. Hence this post, my first blog. I’m about two decades late to the blogosphere, but that’s alright. I’ll try to make use of the format more often.

My motivation for the Rare Books feed was simple: I wanted a place where I could casually scroll through rare book and book collecting content. Why rare books? Well, I’m interested in them as a collector and bookseller. I’m always on the hunt for first editions or signed copies. I like to discuss book collecting as a pastime. I try to keep up with the trade by reviewing bookseller and auction catalogues and by reading news from major organizations like the Antiquarian Booksellers’ Association of America (ABAA) or the International Online Booksellers Association (IOBA). I’m also interested in how book collecting is portrayed in popular media. I like to hear from bibliophiles at various levels of collecting, whether they’re amassing a vast library of notable high-spots, or if they focus on a particular niche or subject matter that may not be monetarily valuable, but certainly holds intellectual or nostalgic promise.

Social media and online forums provide some resources for finding related content. There are Facebook groups and Subreddits and so forth, but honestly, they were never very satisfying. So, I looked to BlueSky because it has a “custom feed” feature. That means BlueSky lets you enact your own content algorithms on its platform. If that seems strange, it’s because it is. I don’t know of any other social media that allows you so much control over the content you consume. I love BlueSky for that. I spend more time there than on any other platform these days. Yet I don’t compulsively check it like I used to do with other social media. Why? I don’t know if I can articulate it in detail, but essentially, I think BlueSky is just conducive to a healthy relationship with social media. It’s also not owned by a weird billionaire… But that’s beside the point. Long-story-short: BlueSky is rad and you should make an account.

To create the Rare Books feed, I used SkyFeed. It’s a third-party application that allows you to build custom feeds. I don’t want this post to become a tutorial on SkyFeed so I won’t get into all the specifics about how it works. It’s very user-friendly, though. You don’t need to know how to code to use SkyFeed. Some familiarity with regular expressions is helpful, but that’s the most technical aspect of the app. If you’re interested, you can learn more on BlueSky’s blog. They explain more about it and highlight its developer, Redsolver, who has done a great job maintaining it through various server expansions and influxes of users on BlueSky.

Breakdown of the Algorithm

Okay, so let’s get into the algorithm. It follows six straightforward steps. I’ll try to break them down as clearly and concisely as possible. By the end of reading, you should have a better sense of why you see what you see when you scroll through the Rare Books feed.

Step 1: Inputting Content

I needed to input content to give my algorithm something to work with. On BlueSky, the content includes posts, reposts, replies, etc. In the SkyFeed app, you can select various sources for content–things like individual users’ posts, content on existing feeds, tagged posts, and so on.

I made it so the Rare Books feed is compiled from the entire network on BlueSky over a five-day period. This means the feed reviews every post from the past five days in search of rare book content. I chose the entire network because I wanted to cast a wide net. If there are any conversations about rare books on BlueSky, I want to be able to find them. Using the whole network as an input gives me an opportunity to do so.

I chose a five-day window because I’ve found that reviewing the past five days provides enough content to scroll for a long time, but not so much that it makes loading the feed painfully slow. I’ll address this challenge in more detail below but suffice it to say that SkyFeed works best when you keep its computational workload as manageable as possible.

Step 2: Removing Replies

I removed all replies from the input. This decision did not come lightly. After several weeks of assessing the quality of the feed, I noticed that a lot of its content included replies with offhand use of book collecting jargon, but not necessarily posts directly discussing rare books or book collecting. Of course, this omission sacrifices those replies that are in fact relevant, but I also didn’t want whole conversations about other stuff appearing on the feed just because one person replied with something like, “I remember when he signed copies of his book at an event in town.” Or “I bet you could learn more by visiting the special collections.” It wasn’t like the feed was ruined by including replies, but after I omitted them, I realized the relevance of the content was mostly better.

Step 3: Regular Expressions

I created a list of regular expressions that capture words or phrases commonly used among booksellers and collectors. This step was the most technical part of the algorithm. It’s also been the most work–work that, in my opinion, will continue for as long as the feed exists. Put simply, language changes and social media platforms are dynamic spaces where the userbase evolves over time. That means there’s no accounting for the precise words that people will use to discuss rare books in the future–and that’s why my list of regular expressions will need to be updated and maintained.

But to get things rolling, I considered what words or phrases most often appear in book collecting circles. To be honest, I haven’t been very scientific about my list. I’m just going off vibes and experience. I would say my experience with the jargon of the rare book world is satisfactory, though. I spend an inordinate amount of time discussing rare books, reading book collecting forums, reading catalogue descriptions, and writing the catalogues for Evening Land Books. I also wrote this free online glossary with contextualized definitions for over 300 book terms. But that being said, I’m sure there are more accurate ways of identifying the most quintessential rare book lingo. There’s a host of text-mining and LLM possibilities in that regard. I just haven’t bothered to try those things for this little side project (yet).

Instead, I took my list of words or phrases that I believe signal rare book content and I added them into my algorithm. As I did this, I reviewed what sorts of content they were finding. Through dozens of iterations, I honed the current list (see Table 1). Over time, I’ll continue to add, replace, or remove words or phrases. It’s certainly laborious to do this kind of maintenance, but I think it’s worth it. It means I’ll continue to have control over the content I consume on BlueSky.

1st edition […] antiquarian	book seminar	huntington library
1st edition […] book	bookauction	ilab […] book
1st edition […] copy	bookhistory	incunab
1st edition […] hardcover	center for the […] book	inscribed 1^st edition
1st edition […] novel	club edition […] book	inscribed first edition
1st edition […] rare	club edition […] rare	john carter brown library
abaa […] book	collectible […] book	manuscript […] volume
abebook	color engraving	manuscript library
antiquarian […] 1st edition	cover […] first edition	marble […] endpaper
antiquarian […] book	dust cover […] edition	marble […] volume
antiquarian […] bookseller	dust cover) […] rare	modern 1st edition
antiquarian […] edition	dust jacket […] book	modern first edition
antiquarian […] first edition	dust jacket […] edition	morgan library
antiquarian […] rare	dust jacket […] rare	newberry library
aquatint	early edition […] rare	original […] dust jacket
biblio.com	early printing […] rare	photogravure
bibliographical society	engraving […] book	rare […] bookseller
book […] 1st edition	ephemera […] book	rare […] bookshop
book […] 1st edition	ephemera […] rare	rare […] bookstore
book […] antiquarian	extant cop	rare […] lithograph
book […] bibliographical	fine press	rare […] manuscript
book […] club edition	finebooksmagazine	rare […] woodcut
book […] early edition	first american edition	rare book
book […] early printing	first british edition	rarebook
book […] ephemera	first edition […] book	second printing […] rare
book […] first edition	first edition […] copy	signed […] 1st edition
book […] first edition	first edition […] hardback	signed […] first edition
book […] first printing	first edition […] hardcover	signed […] lithograph
book […] original binding	first edition […] novel	signed first edition
book […] woodcut	first edition […] rare	special collections
book auction	first english edition	true 1st edition
book catalogue	first printing […] book	true first edition
book collect […] rare	first printing […] rare	woodcut […] book
book collecting	folio society
book expert	grolier club
book histor	houghton library

TABLE 1: Words or Phrases that Appear in the Rare Books Feed

With the list in place, my next two challenges were translating the words or phrases into efficient regular expressions and culling the list of regular expressions so it didn’t exhaust SkyFeed’s computational resources. I should note here, too, that these challenges will continue as part of the maintenance of the feed. But on the former challenge, I had numerous options for capturing words or phrases with regular expressions. And just so we’re clear, regular expressions are metacharacters that represent patterns in written language. For example, if you wanted to find any content that mentions “ephemera” and then “first edition” in the same post, you could use the regular expression “ephemera.*first edition”. The “.*” essentially translates to “any number of characters between”–so, “ephemera.*first edition” essentially tells the algorithm to find any posts where 1) the word “ephemera” appears, 2) followed by any number of characters, and 3) followed by the phrase “first edition”. That’s just one example of a regular expression. If you’d like to learn more about them or practice using them, I’d try this website. It allows you to test regular expressions and navigate exactly what metacharacter patterns they represent.

So far, I’ve come up with 31 regular expressions (see Table 2) that capture all the words and phrases in my list. Some are just keywords or phrases like “rare book” or “fine press” or “incunab”. Others are far more complicated with several combinations of words or phrases regarded in multiple orders at once. These regular expressions have proven effective, but again, they require continual maintenance. Sometimes, I’ll be scrolling the Rare Books feed and I’ll notice some content that I don’t think really belongs. When that happens, I return to SkyFeed and tweak the regular expression that is causing this irrelevant content to appear.

Finally, I should note that these regular expressions don’t account for every word or phrase I’d like to capture, but the second challenge at this step in the algorithm is culling my list so it doesn’t exhaust SkyFeed’s computational resources. In other words, I can’t use too many regular expressions or the feed will take forever to load. I’ve found that SkyFeed seems to be able to handle about 30 regular expressions. But that’s still pushing it… And it’s a shame! I know there are a lot more words, phrases, and combinations I could use to identify rare book content, but it’s a balancing act between thoroughness and efficiency.

([\W]signed|inscribed) (1st|first) edition

([\W]signed|inscribed) (lithograph|ephemera)

(bibliographical|[\W]folio) society

(grolier|caxton) club

(modern first|modern 1st)[\W]edition

(rare|[\W]abe)book

aquatint[^a-z]

biblio[.]com

bookauction

bookhistory

col(o|ou)r engraving

dust jacket.*(book|edition)

extant cop

fine press

finebooksmagazine

first (american|english|british) edition

incunab

manuscript.*volume

marble.*(endpaper|volume)

original dust (jacket|cover)

photogravure

rare book

special collections

true (1st|first) edition

TABLE 2: Regular Expressions in the Rare Books Feed

Step 4: Inverted Regular Expressions

Next, I created a list of inverted regular expressions (see Table 3) that–it may surprise you–remove certain posts. This step is a bit controversial because it implies some distinctions between closely associated areas of collecting. It also implies some definitions of what constitutes a “rare book”. While I don’t usually subscribe to such hard rules or designations, I still don’t think I can part with my inverted regular expressions. They help to ensure that the Rare Books feed is capturing what I want it to. Otherwise, it would be subsumed by other things.

Those “other things” mainly fall into five categories: 1) comic books, 2) vinyl records, 3) trading cards, 4) Dungeons & Dragons collectibles, and 5) independent authors promoting their books.

Here’s how I determined these things: as I reviewed the content captured by my regular expressions in Table 2, I noticed a lot of overlap between the jargon of the rare book world and the jargons of these other categories. If it was just the occasional comic book post or indie author promoting their work, I wouldn’t have felt the need to remove them. But it seems these subcommunities produce a lot of content on BlueSky–and that’s good. However, they produced so much that it was drowning out the other content on my feed. I therefore had to remove content falling into these categories. I decided to do this by using inverted regular expressions–that is, rather than include content containing these regular expressions, the inverted ones filter it out.

I suppose there’s one other category I filtered out, too. It’s miscellaneous stuff. For example, the electronic music duo Autechre released an album called “Incunabula”. You’d be surprised how often it gets mentioned on BlueSky. There’s also an account that posts top trending words on BlueSky and it consistently highlights words from Table 1–so, a few times a week, there would be a “top trending” word cloud on the Rare Books feed, which wasn’t relevant. This is why some seemingly random phrases appear in Table 3–things like “autechre” or “trending words”. They’re just anecdotal instances of diction overlap that cannot be categorized as overlap with other bookish subcommunities or collecting interests. Yet they appeared on Rare Books feed often enough for me to notice, so I filtered them out.

(roleplaying|role-playing|role playing)

[^a-z]I published the \w\w\w\w\w edition

[^a-z]I wrote the \w\w\w\w\w edition

autechre

collect[\w]ble.*card

comic book

comicbook

(pre-order|preorder)

sourcebook

strictly prohibited

trending words

vinyl

TABLE 3: Inverted Regular Expressions in the Rare Books Feed

Step 5: Input List of Rare Book Posters

Alright, I guess I’ll pause here and note that the Rare Books feed is technically functional with steps 1 through 4. It sifts through everything on BlueSky over a five-day period, identifies content that uses the rare book jargon I’ve defined in my regular expressions, then filters out irrelevant content. For a while, I felt these steps were enough. They were providing a fair amount of content to keep me satisfied each day as I scrolled through the feed.

But then I began to notice some recurring accounts on the feed. They were individuals, booksellers, and institutions that regularly posted rare book content. Naturally, I began to follow them, and in turn, I noticed that some of their rare book posts were not appearing on the feed. This was because my list of regular expressions was limited for the sake of computational efficiency. The consequence was that it was missing posts that were relevant to rare books and book collecting.

While it’s not a perfect solution, I decided to add this step to the algorithm: the step of inputting all posts from accounts I have deemed Rare Book Posters. These are accounts I’ve vetted that almost always, if not always, post rare book and book collecting content. My list of Rare Book Posters can be viewed here. As I write this in October of 2024, the list includes about twenty accounts. They are all booksellers, rare book institutions, or rare book librarians and archivists. Everything they post appears on the feed whether it includes rare book language or not.

Of course, this doesn’t solve the limitations of my regular expressions, but it does make the feed more robust without adding too much computational stress to the system. As I continue to enjoy the feed–and as the BlueSky userbase continues to grow–I’ll vet more accounts and add them to the list. It’s just another way of capturing a larger slice of the relevant content.

Step 6: Sort Output

Last but not least, I needed to sort the content the algorithm captured. In SkyFeed, you can sort feeds in several ways (by like count, reply count, repost count, randomly, and more). I chose to sort the Rare Books feed by Creation Date–or, in other words, newest posts first. I may change this option eventually, but I think it works best for feeds checked daily. By sorting by Creation Date, I don’t ever have to scroll past the stuff I’ve already seen.

Final Thoughts

That’s it! Six easy steps. Just to recap more concisely, the six steps to the algorithm are:

Inputting everything on BlueSky over the last five days
Removing replies
Capturing relevant content with regular expressions
Filtering out irrelevant content with regular expressions
Inputting posts from especially bookish accounts
Sorting the output

This project was more involved than I had imagined when I first started, but I wouldn’t say it’s work. It’s just become part of my social media usage. Once I had established the basic steps of the algorithm, it wasn’t hard to tweak things to try to carve out better slices of rare book content on BlueSky. Overall, too, this project has been quite liberating. Back when I spent more time on Meta’s platforms, I often felt like I was being inundated with hot garbage. I’ve never felt that way with BlueSky, and I’ve especially never felt that way with the Rare Books feed because I get to choose the filters, parameters, inputs, and so forth. It’s a nice feeling (autonomy), one not typically felt on social media.

The custom feed feature is not perfect, though. The guardrails in terms of overloading the app are difficult to work around. As it currently stands, the Rare Books feed takes about six seconds to load. I know that’s not great. But it’s the compromise I’m willing to make in order to use the number of regular expressions I think are necessary and to sift through the entire BlueSky network in order to find as much relevant content as possible. I also believe the server challenges will be addressed in time. If more users start making custom feeds and BlueSky continues to grow, I bet there’ll be more investment in SkyFeed or other custom feed features.

Now let me conclude by directly addressing some potential audiences who may have bothered to read this far.

Firstly, to my fellow bibliophiles: please join BlueSky and post rare book content. Invite booksellers, rare book institutions, and collectors. If you primarily post rare book content, I’ll add you to my list of vetted Rare Book Posters. It would be great to see the rare book community grow on BlueSky.

Secondly, to my fellow academics: may I kindly suggest you toy around with custom feeds on BlueSky and consider assigning the activity to undergraduate students. I think our students would benefit tremendously from at least some exposure to custom feeds. I’ve observed that students desire more knowledge about social media and its impact(s) on their lives. By building custom feeds, they would be given the opportunity to learn more about content algorithms and some fundamental differences between platforms. They may also discover the value (as I have) of more autonomy online. I know I’ll be incorporating the exercise into classes I teach in the future. I’m confident my students will respond positively.