Back

How to build a Hacker News Frontpage scraper with just 1 line of Ruby code

Today I stumbled across How to build a Hacker News Frontpage scraper with just 7 lines of R code and I felt the need to respond with a more superior language. So in this post I will walk you through creating an HN scraper using a single line of ruby code.

The basics of scraping📎

A basic scraper works in 3 stages, retrieving the webpage, parsing the content and then extracting and organizing the data.

Retrieving the webpage📎

Let's get started with retrieving the webpage. For this we will use Ruby's Net::HTTP.get method. To use this method we have to import the net/http module. We will use the method in combination with the URI class to specify the https protocol to avoid being redirected.

require 'net/http'
p Net::HTTP.get(URI('https://news.ycombinator.com'))

I use the p method to print the output of the function. If we run this program we will see the retrieved HTML being printed to the terminal.

Parsing the webpage📎

Manually extracting the required elements from the raw HTML output is a pain, so we will use an HTML parser to do it for us. We will use Nokogiri, which can be installed via gem install nokogiri. Nokogiri has very good documentation but also a cool community-maintained cheatsheet.

To parse the HTML simply do this and you will see the parsed HTML.

require 'net/http'
require 'nokogiri'
p Nokogiri(Net::HTTP.get(URI('https://news.ycombinator.com')))

Extracting our data📎

Now that we have a tree of HTML elements we have to search for the exact elements that we want to extract data from. Luckily Nokogiri supports CSS selectors, which we can use to efficiently select the HTML elements we need. Let's take a look at the website to figure out what we need to select.

If you head to https://news.ycombinator.com and press F12 most browser will open a Developer Tools window. At the top left of said window you will find a handy element picker. Use this to view certain elements on the page.

The content we want to scrape is the title, url, points, age and author.

What we learn from the developer tools is that a post is split up into two separate rows. The first row we can use to get the title and url.

The second row contains the points, age and author.

Title and Url📎

Let's first focus on the title and url.

We can get both the title and url from the a hyperlink itself, which is inside a span element with the class titleline. However if we were to use the CSS selector .titleline a it will also select the second a element in the span (the from?site= one). So we will use .titleline > a to get any a elements whose direct parent is of the titleline class.

Let's get that into our code.

require 'net/http'
require 'nokogiri'
puts Nokogiri(Net::HTTP.get(URI('https://news.ycombinator.com'))).css('.titleline > a')

I am using puts here because it is able to print the elements like an HTML tag.1 The output will be a nice list of a tags.

Let's separate the href attribute and the text itself.

require 'net/http'
require 'nokogiri'
Nokogiri(Net::HTTP.get(URI('https://news.ycombinator.com'))).css('.titleline > a').each {|a|
    p a.text
    p a.attr('href')
    puts
}

For each a tag, we use p to print the text and then the value of the href attribute. Then we use puts to print a newline.

Points, author and age📎

Now let's focus on the points, age and author. They all have the span.subline element as their parent, so we will select that first, and then use that to extract the rest.

require 'net/http'
require 'nokogiri'
Nokogiri(Net::HTTP.get(URI('https://news.ycombinator.com'))).css('.subline').each {|l|
    p l.css('.score').first.text
    p l.css('.hnuser').first.text
    p l.css('.age').first.text
    puts
}

We grab all the span.subline elements, and for each of those we select the .score, .hnuser and .age spans. We grab the first element found and print the text.

However the points are now printed as a string including the word points. We can remove the points suffix that using the chomp method, and convert the rest to an integer using to_i. Like l.css('.score').first.text.chomp(' points').to_i.

The age is currently a human readable string related to our current time, which isn't that useful. Instead of getting the text content of the span.age, we could grab the title attribute which contains a timestamp from when the post was made. Like .css('.age').first.attr('title').

Pairing the title and subline elements📎

So the elements we need are split between the span.subline and the span.titleline. They do not share a common ancestor because they are both in a separate table row (tr).

Fortunately Nokogiri can search for multiple selectors at once. Like this:

require 'net/http'
require 'nokogiri'
Nokogiri(Net::HTTP.get(URI('https://news.ycombinator.com'))).css('.titleline > a', '.subline')

Unfortunately, this will first search for all the .titlelines, and then the .sublines. It will not return an array in the order that they appear on the page.

Ideally, we want to have a list of pairs of .titlelines and .sublines. Instead we have a list of 30 .titlelines followed by 30 .sublines.

Let's convert the array to the structure we want. When facing a programming challenge it's often useful to the problem more abstract. Let's imagine an array [1,2,3,4,5,6]. We want to convert this to pairs, [[1,4],[2,5],[3,6]]. We can first split the array down the middle using each_slice, this will create an enumerator that we can turn into an array again using to_a. The split array will look like this:

# [1,2,3,4,5,6].each_slice(3).to_a
[
  [1,2,3],
  [4,5,6]
]

Now we can use transpose to convert the rows to columns and columns to rows.

# [1,2,3,4,5,6].each_slice(3).to_a.transpose
[
  [1,4],
  [2,5],
  [3,6]
]

We started with 2 rows and 3 columns, now we have 3 rows and 2 columns. Let's apply our new knowledge to Nokogiri's output.

require 'net/http'
require 'nokogiri'
Nokogiri(Net::HTTP.get(URI('https://news.ycombinator.com'))).css('.titleline > a', '.subline').each_slice(30).to_a.transpose.each { |a,l|
    puts a
    puts l
    puts
}

First we split and transpose the output into pairs. Then we iterate over the pairs, deconstructing them to the variables a and l. When we print these with puts we can see that they're the elements we expect.

Now let's use our previous code to print the variables we aim to extract:

require 'net/http'
require 'nokogiri'
Nokogiri(Net::HTTP.get(URI('https://news.ycombinator.com'))).css('.titleline > a', '.subline').each_slice(30).to_a.transpose.each { |a,l|
    p a.text
    p a.attr('href')
    p l.css('.score').first.text.chomp(' points').to_i
    p l.css('.hnuser').first.text
    p l.css('.age').first.attr('title')
}

Now that we can succesfully extract all the data we need, let's save it in a structure. We will just map the existing array of element pairs to a table-like array.

require 'net/http'
require 'nokogiri'
pp Nokogiri(Net::HTTP.get(URI('https://news.ycombinator.com'))).css('.titleline > a', '.subline').each_slice(30).to_a.transpose.map { |a,l|
    [a.text, a.attr('href'), l.css('.score').first.text.chomp(' points').to_i, l.css('.hnuser').first.text, l.css('.age').first.attr('title')]
}

I use pp here to pretty inspect the output.1 The output is a table of posts, perfect. If we remove the newlines and pretend that the require statements don't exist we end up with this beautiful one-liner:

pp Nokogiri(Net::HTTP.get(URI('https://news.ycombinator.com'))).css('.titleline > a','.subline').each_slice(30).to_a.transpose.map{|a,l|[a.text,a.attr('href'),l.css('.score').first.text.chomp(' points').to_i,l.css('.hnuser').first.text, l.css('.age').first.attr('title')]}

That's not a one-liner📎

Does it count as a one-liner if you have to remove two require statements?2 Yes it does because we can omit those and require those libraries right from the command line using Ruby's -rlibrary option.

We just need to run our program like ruby -rnet/http -rnokogiri hn.rb.

That's right, fuck you R.

Output to a CSV file📎

R can display it's tables like actual tables so we have to do something similar. Let's write our output to a CSV file.

The standard library of ruby includes a csv module, which can convert arrays to CSV rows using to_csv.

So in our map block we can convert our row to a csv string using to_csv. And then join the mapped array so we have a long string which we can write to a file using File.write.

require 'net/http'
require 'nokogiri'
require 'csv'
File.write 'hn.csv', Nokogiri(Net::HTTP.get(URI('https://news.ycombinator.com'))).css('.titleline > a', '.subline').each_slice(30).to_a.transpose.map { |a,l|
    [a.text, a.attr('href'), l.css('.score').first.text.chomp(' points').to_i, l.css('.hnuser').first.text, l.css('.age').first.attr('title')].to_csv
}.join

Perfect, now get rid of everything unnecessary:

File.write 'hn.csv', Nokogiri(Net::HTTP.get(URI('https://news.ycombinator.com'))).css('.titleline > a', '.subline').each_slice(30).to_a.transpose.map {|a,l|[a.text, a.attr('href'), l.css('.score').first.text.chomp(' points').to_i, l.css('.hnuser').first.text, l.css('.age').first.attr('title')].to_csv}.join

And run the file with ruby -rnet/http -rnokogiri -rcsv hn.rb

Conclusion📎

So we've built a Ruby program so simple you don't have to leave the command line to run it:

ruby -rnet/http -rnokogiri -e "pp Nokogiri(Net::HTTP.get(URI('https://news.ycombinator.com/newest'))).css('.titleline > a','.subline').each_slice(30).to_a.transpose.map{|a,l|[a.text,a.attr('href'),l.css('.score').first.text.chomp(' points').to_i,l.css('.hnuser').first.text, l.css('.age').first.attr('title')]}"

The original article that made me write this actually took a very different approach, you can see the code here. They just selected all the needed elements one-by-one and then constructed the table. Which is definitely a better solution but also definitely less cool. It also doesn't apply as well to scraping other stuff.


  1. Ruby comes with 4 printing methods. print,puts, p and pp. puts is your normal way of printing and tries to convert Objects to readable Strings, making use of the to_s method of an Object. p is mainly for debugging and prints an Object in a way that it's type is obvious, it uses the inspect method. p also returns the Object so you can use it inside an expression without affecting it. There is also pp which uses the pretty_inspect method, which generally adds more newlines for readability. print is like puts but doesn't print a newline. 

  2. The original article actually used 8 lines of R code, not 7, because they didn't count an import statement. You view the code here and count it yourself. 

me@levitati.ng

Created: 2023-07-17

Updated: 2023-07-17

login