Realtime Reading Lists With Jekyll Hooks

Note

This content is now outdated, as I’ve migrated my site over to Astro. I hope you can still find it helpful. Enjoy 🙂

Preface

You may have noticed my fancy books page, which contains an up-to-date list of what I’m reading and what I’ve finished. That page is powered by Goodreads and updates on a daily basis. Initially, I was updating the page manually with a local build of my site and a push to GitHub, but I wanted to automate it— I already log my books on Goodreads, why shouldn’t that be the end of it?

Goodreads profiles can be public (and my profile is). With a bit of digging, you can find RSS feeds for Bookshelves, i.e. Reading, Read, To-Read. Once I had the links, I just needed to loop through the feed items + generate some Jekyll data and I’d be set, right?

Reality check

In reality, the task is a little more complex. Here were the most prominent challenges:

Since I was building my site locally and pushing to GitHub pages, I would have to rebuild locally if I wanted the lists to update— that’s no fun. This was resolved through GitHub actions, which I’ll discuss shortly.
I initially went the route of using a Generator to build YAML files with the RSS items. That was problematic because of how Jekyll renders a website: Generators that fetch data and add it to the site are prioritized after the site is rendered. In practice it meant I had to build my site twice locally for changes to propagate (once to change the files and a second time to actually see them rendered in the browser). Additionally, when using Generators with a GitHub action to build the site, I was unable to overwrite the necessary files in the master branch, so my solution was completely non-functional with a remote build!
There is remarkably little documentation/discussion on how or when to use a Generator, Hook, or other custom plugin. I ultimately found this blog helpful, along with the (sparse) Jekyll documentation on hooks and plugins.

After my initial crack at Generator -> GitHub Action -> Reading List failed, I went back to the drawing board.

The Solution

Until now, this post has been relatively boring with little useful information, so I’ll try to make it worth your time.

My plugin needs to:

Run at build.
Fetch and parse items from my Goodreads feed.
Inject that into my site (somehow) so I can loop through them in a list with Liquid.
Have the ability to run via a GitHub action, i.e. be a truly automated solution.

Fetching RSS items

This was surprisingly simple, considering I had no Ruby knowledge beforehand. I used Open URI to get the URL, the RSS library to parse the feeds, and Nokogiri for some html parsing, thought that wasn’t strictly necessary.

reading_url = [your-url-for-reading-books]
read_url = [your-url-for-read-books]

def get_rss_items(input_url)
    item_list = []
    URI.open(input_url) do |rss|
        feed = RSS::Parser.parse(rss)

        feed.items.each do |item|
            parsed = Nokogiri::HTML.parse(item.description)

            author = item.description.match(/author:\s*((\w|\s|\.|\/)*)/)[1]
            date_read = item.description.match(/read\s*at:\s*((\d|\/)*)/)[1]
            img_src = parsed.xpath("//img").attr('src').text
            url = parsed.xpath("//a").attr('href').text
            guid = item.link.split('/')[-1].split('?')[0]

            payload = {'title' => item.title,
                        'link' => url,
                        'img_src' => img_src,
                        'date_read' => item.pubDate.to_date,
                        'author' => author,
                        'guid' => guid
            }
            puts item.title
            item_list.append(payload)
        end
    end
    return item_list
end

We start by opening the URL as rss, then passing that variable to our parser. A simple Ruby loop says that, for each item in the feed, we want to parse the description and other attributes (using Nokogiri and some regex) and assign them to a payload. We then print the title to the logs and append the payload to our list. The function then returns that list. Now all we need to do is get that into our site!

Using GitHub Actions to build the site on a schedule

After a bit of Googling, I was able to find this action that automates Jekyll deploys. Setting it up is pretty simple.

Create a YAML file in your GitHub pages repository
Configure the YAML file to your preferred settings using the action docs. Note: this changes how your GitHub page functions. Rather than queuing off a master branch, you’re building the site, outputting it to a separate branch (gh-pages is the default) then building the page off that branch. It requires messing around with some settings, but I found the author of the action’s guide to be sufficient.
Run the action on a schedule!

Now that we have the action firing, the site is being built remotely! Cool! But it’s completely indifferent from before— not cool!

Loading to site data pre-render

We need to get our Ruby function running at the point just before the site is built. That way, we can create a variable accessible to the site/page/doc to iterate through. digging through the Jekyll docs, I found the Hook page. From the syntax, it appears to do just what we want:

Jekyll::Hooks.register :site, :pre_render do |site|
  # code to call after Jekyll renders a page
end

Using the site and pre_render arguments, we should be able to generate content prior to rendering and make that available to the entire site!

Wrapping our function in the Hook register:

require 'rss'
require 'open-uri'
require 'yaml'
require 'nokogiri'

Jekyll::Hooks.register :site, :pre_render do |site, payload|
    reading_url = [your-url-for-reading-books]
    read_url = [your-url-for-read-books]

    def get_rss_items(input_url)
        item_list = []
        URI.open(input_url) do |rss|
            feed = RSS::Parser.parse(rss)
            # Add fake virtual documents to the collection
            feed.items.each do |item|
                parsed = Nokogiri::HTML.parse(item.description)

                author = item.description.match(/author:\s*((\w|\s|\.|\/)*)/)[1]
                date_read = item.description.match(/read\s*at:\s*((\d|\/)*)/)[1]
                img_src = parsed.xpath("//img").attr('src').text
                url = parsed.xpath("//a").attr('href').text
                guid = item.link.split('/')[-1].split('?')[0]

                payload = {'title' => item.title,
                            'link' => url,
                            'img_src' => img_src,
                            'date_read' => item.pubDate.to_date,
                            'author' => author,
                            'guid' => guid
                }
                puts item.title
                item_list.append(payload)
            end
        end
        return item_list
    end

    site.data['reading'] = get_rss_items(reading_url)
    site.data['read'] = get_rss_items(read_url)

end

Boom! Now we’re storing our lists in site.data['reading'] and site.data['read'], respectively. Saving this to a Ruby file and popping it in the _plugins directory make it run on every build. Now that the data is accessible to the site and pages therein, Liquid makes it simple to iterate through them. Since things are always changing, you can checkout my current Goodreads script here.

<div>
{% raw %}{% for book in site.data['reading'] %}
    <a href= "{{ book.link }}">
      <div>
        <h5>{{ book.title }}</h5>
          <p>{{ book.author }}</p>
        <!-- <small>{{ book.date_read }}</small> -->
      </div>
    </a>
{% endfor %}{% endraw %}
</div>

## Read

<div>
{% raw %}{% for book in site.data['read'] %}
    <a href= "{{ book.link }}">
      <div>
        <h5>{{ book.title }}</h5>
          <p>{{ book.author }}</p>
        <!-- <small>{{ book.date_read }}</small> -->
      </div>
    </a>
{% endfor %}{% endraw %}
</div>

And voila! The books page arrives at it’s present form. Marking a book read or reading on Goodreads will flow through to the site nightly and 100% automatically!

I’d love to hear if you found this helpful or implemented your own solution. Don’t hesitate to reach out via one of the links on-site!