Creating an Instapaper clone in Quiver

Binary Adventures

If you’re not familiar with Quiver (a macOS only app), here’s the long and the short of it:

Quiver is a notebook built for programmers. It lets you easily mix text, code, Markdown and LaTeX within one note, edit code with an awesome code editor, live preview Markdown and LaTeX, and find any note instantly via the full-text search.

I write all of my professional and development related information in Quiver. Even my blog posts are created there and afterwards exported to Markdown for further processing (more about that in a future post).

While I’ve switched to DEVONthink to capture and store web articles for me, I wanted to see if I could write an Instapaper clone that would be able to save a simplified version of any URL directly in Quiver. If you’ve never heard of Instapaper (I doubt it): it’s a read-later service that allows you to clip web content, save it and read it on any device.

To capture the content of a web page, we have to overcome two challenges:

  1. Web pages usually consist of a lot more than just an article. E.g. advertising, comments, sidebars with additional content, … We want to make sure that we only capture the main content, not the cruft around it.
  2. The web page content is formatted in HTML. As this format is not supported by Quiver, we’ll need to convert it to Markdown.

Approach

One of the first things we need to familiarize ourselves with, is the Quiver Data Format. Each note is comprised of several JSON files and optionally additional resources (binary files).

We can tackle the first hurdle by using the URL in the Instapaper bookmarklet (http://www.instapaper.com/text?u={url}) to turn any website into a page with a reduced layout, considerably simplifying the HTML.

Next, we’ll use the html2text Python library to turn the content into a Markdown compliant syntax that Quiver understands and is able to render.

Other third-party libraries we’ll use are requests, BeautifulSoup and the lxml parser.

Code

The QVR_NOTEBOOK variable contains the path to the Quiver notebook where you’d like to save the clipped web pages (make sure you read the Quiver Data Format).

You can clip a web page by calling clip_url, e.g. clip_url('http://tomaugspurger.github.io/modern-5-tidy.html').

import json
import uuid

from pathlib import Path
from urllib.parse import urlparse
from datetime import datetime

import html2text
import requests
from bs4 import BeautifulSoup


IP_URL = 'http://www.instapaper.com/text?u={url}'
QVR_NOTEBOOK = './Quiver/Quiver.qvlibrary/F54CCC03-A5EC-48E7-8DCD-A264ABCC4277.qvnotebook'


# Download the images and generate UUIDs
def _localize_images(resource_path, img_tags):

    for img_tag in img_tags:
        url = img_tag['src']
        r = requests.get(url)

        # Define the extension and the new filename
        img_ext = Path(urlparse(url).path).suffix
        img_name = '{}{}'.format(uuid.uuid4().hex.upper(),
                                 img_ext)
        img_filename = Path(resource_path, img_name)

        with open(str(img_filename), 'wb') as f:
            f.write(r.content)

        # Convert the original URL to a Quiver URL
        img_tag['src'] = 'resources/{}'.format(img_name)


# Write content.json
def _write_content(note_path, note_title, note_text):
    qvr_content = {}
    qvr_content['title'] = note_title
    qvr_content['cells'] = []
    cell = {'type': 'markdown', 
            'data': note_text}
    qvr_content['cells'].append(cell)

    with open(str(Path(note_path, 'content.json')), 'w') as f:
        f.write(json.dumps(qvr_content))


# Write meta.json
def _write_meta(note_path, note_title, note_uuid):
    timestamp = int(datetime.timestamp(datetime.now()))
    qvr_meta = {}
    qvr_meta['title'] = note_title
    qvr_meta['uuid'] = note_uuid
    qvr_meta['created_at'] = timestamp
    qvr_meta['updated_at'] = timestamp

    with open(str(Path(note_path, 'meta.json')), 'w') as f:
        f.write(json.dumps(qvr_meta))


def clip_url(source_url):
    # Download the IP version of the URL
    r = requests.get(IP_URL.format(url=source_url))
    r.raise_for_status()
    bs = BeautifulSoup(r.content, 'lxml')

    qvr_note_uuid = str(uuid.uuid4()).upper()

    # Create the folders
    paths = {}
    paths['notebook'] = QVR_NOTEBOOK
    paths['note'] = Path(paths['notebook'], '{}.qvnote'.format(qvr_note_uuid))
    paths['resources'] = Path(paths['note'], 'resources')
    paths['resources'].mkdir(parents=True, exist_ok=True)

    # Replace the original links by the quiver links
    _localize_images(paths['resources'], bs.find_all('img'))

    # Remove title
    _ = bs.select('body  main > div.titlebar')[0].extract()

    # Convert to Markdown
    parser = html2text.HTML2Text()
    parser.protect_links = True
    parser.wrap_links = False
    parser.body_width = 0
    note_text = parser.handle(str(bs.find('main')))

    _write_content(paths['note'], 
                  bs.head.title.string,
                  note_text)

    _write_meta(paths['note'], 
               bs.head.title.string,
               vr_note_uuid)

Alternative

Markdown, Please! is an alternative to Instapaper’s bookmarklet URL. The URL to use is http://markdownplease.com/?url={url}.

There are a few differences though:

  • The webpage is immediately converted to Markdown (contained in the body > pre tag).
  • The text is hard wrapped at 80 characters, causing link URLs to split.