Building Activity Feeds with MongoDB
by Andrew Kandels
MongoDB is an excellent data store for capturing and querying user activity feeds. It's schema-less design allows for grouping of similar, but different, content -- often what feeds are comprised of. I recently created an activity feed system on a large, high traffic site and wanted to share some of the challenges and solutions I found.
I run a multi-tenant environment with many sites that share the same database segmented by a property named
siteId. Each site has one or more followers (i.e.: user logins) with a many-to-many relationship between sites and followers. The followers have their own feeds mashing up the site feeds they follow.
Denormalize and Duplicate
I faced a dilemma: I approached my schema like I would in a relational database. I created a single feed collection and used complex queries to retrieve each feed based on the requestor (follower or site). This introduced a large amount of complexity and made the system efficient for writes and slow for reads; exactly the opposite of what's ideal for this type of system.
Embrace the denormalization of MongoDB and architect your feeds so that each type of feed has its own collection. The word type here is loosely defined by each type of unique or complex query you would run assuming you had only one collection. In my case, this meant one collection for followers and another for site.
This means every time a site activity happens, I have to write it to the site feed as well as the feed of every follower of that site. This means duplicate storage; but, activity documents are very light and I'm happy to trade storage for performance. I have to send out email notifications anyway and I do this in a background worker, so it was easy enough to marry this effort to inserting the new activity feed documents.
Benefits of this approach:
- If you join or leave a site, your profile feed only changes going forward -- what I'd expect.
- The most common queries are fast, simple (by id) and easy to index.
- Less complexity. The work is done once: at the time of content creation.
- Followers can have activities unique to only their feeds.
- Followers can hide items in their personal feeds just by removing those documents.
Most activity feeds are capped either by time or by number of items. MongoDB provides two solutions for this out of the box: capped collections and time-to-live indexes. Both are applied at a collection level.
The first I considered was capped collections. I could assign a fixed size to the collection and old documents would automatically drop off as new ones were added once my collection reached that size. This wasn't really a solution for me as that fixed size would need to change to match our growth and isn't very predictable. I concluded: this works better for logging tables.
The second option was to timeout old activities using a time-to-live index. While this proved to be an excellent option (and also works quite nicely in multi-tenant environments with segmented feeds in each collection), we unfortunately had a business rule that required us to keep the last 1,000 items. We didn't want sites with large activity feeds to seem barren just because they hadn't been visited in several months.
So in short, I needed to save the last 1,000 feed items for each unique site and follower (segmented by their respective id) in the two collections. This required an application-driven solution I call capped and segmented activity feeds (see next).
Capped and Segmented Activity Feeds
My requirements: I have a collection called
site representing each client and a collection named
feed_site which represents that site's activity feed. I need to save the last 1,000 documents in
feed_site for each unique site id.
First, I created a property called
numFeed in my site document which is incremented upon each new document being inserted into
feed_site. Thankfully, MongoDB supports atomic incrementors so I won't run into trouble with race conditions.
Second, upon inserting new
feed_site documents, retrieve their numerical incremented id from
numFeed. Then, run a modulus of that number against 1,000 -- my limit.
Third, the primary key (_id property) of my
feed_site document becomes the site id appended by an underscore and the result of my modulus operation. This means my ids will range from 1_0 to 1_999 leaving only 1,000 possible unique combinations for MongoDB's built-in forced unique _id property. Because of this, you must use upserts for adding new
Querying a site's feed is simple. Query by site id and sort by the embedded timestamp property within the activity. Everything else will magically take care of itself. Once the counter modulus resets from 999 back to 0 (on my 1,001st activity), the upsert will replace the oldest document (id: 1_1).
There are other benefits to this approach as well:
- You're not locked into 1,000 as a cap. Change it at any time within your code.
- Works great with sharding: use _id as your shard key which is already set to the site id plus the modulus for additional entropy (which helps with hot chunking problems in the balancer) so you can share your query's index.
- Great for pagination as the
numFeedproperty prevents the need for count queries.