Show HN: A thin Python library to access HN data using Algolia's API

15 points by santiagobasulto 6 years ago

Hello community. Some time ago I was trying to create a project for my students using Hacker News data. As you might know, HN offers an official API [0], but it's based on Firebase and I felt it's main usage is to build clients, rather than consult data.

I found out that Algolia also provides an official REST API [1]. It's exactly what I needed: the ability to "search" HN. Either by keywords, type of stories (Show HN, Ask HN, etc) and/or date.

So I created a thin python wrapper on top of Algolia's Search API: https://github.com/santiagobasulto/python-hacker-news

The library is in early stage, but already usable. A few examples:

How to search posts from one user:

    results = search_by_date(
        author='pg',
        hits_per_page=1000)

How to search posts by type (this would find this same post)

    results = search_by_date(
        'thin python library',
        show_hn=True,
        hits_per_page=1000)

I'm working on implementing the the other methods. If you have suggestions please bring them up!

[0] https://github.com/HackerNews/API

[1] https://hn.algolia.com/api

minxomat 6 years ago

Awesome. I imagine this being useful for things that quickly check if something exists on HN or watch for new items etc.

Though I think this needs clarification:

> but it's based on Firebase

The entire HN dataset is also available as a public BigQuery dataset, which enables much more intricate queries. For example, the following query means "Get all Show HNs with more than 5 or more points and 5 or more comments, along with the decoded submission title and all decoded top-level comments which are neither dead nor deleted" (and page):

    CREATE TEMPORARY FUNCTION
      HTML_DECODE(enc STRING)
      RETURNS STRING
      LANGUAGE js AS """
    var decodeHtmlEntity = function(str) {
      return str.replace(/&#(\\d+);/g, function(match, dec) {
        return String.fromCharCode(dec);
      }).replace(/&#x([a-fA-F0-9]+);/g, function(match, hex) {
        return String.fromCharCode(parseInt(hex, 16));
      });
    };
      try { 
        return decodeHtmlEntity(enc);;
      } catch (e) { return null }
      return null;
    """;
    WITH
      top_shows AS (
      SELECT id, HTML_DECODE(title) AS dtitle
      FROM `bigquery-public-data.hacker_news.stories`
      WHERE descendants >= 5 AND score >= 5 AND title LIKE "Show HN:%"),
      first_comments AS (
      SELECT
        parent,
        HTML_DECODE(REGEXP_REPLACE(text, r"(</?[a-z]+>)|(<a href=\")|(\" rel=\"nofollow\">.+?</a>)", " ")) AS dcomment
      FROM `bigquery-public-data.hacker_news.comments`
      WHERE dead IS NULL AND deleted IS NULL )
    SELECT
      top_shows.id, top_shows.dtitle, first_comments.dcomment
    FROM top_shows JOIN first_comments
    ON first_comments.parent = top_shows.id
    LIMIT 1000 OFFSET 0

So if you need to answer specific questions like this, which could return 500k+ rows, it's better to use BigQuery than stressing the API.