Evernote Extraction

I take notes all the time. I love having access to my notes wherever I go. Evernote does that. However, I’ve become increasingly dissatisfied with the complexity of their client software. Also, they recently stopped supporting Geeknote, a CLI client. [1] Geeknote has it’s own problems, so maybe it’s time to make a change.

After evaluating a number of solutions, I settled on vimwiki. [2] Vimwiki will let me manage my information in plaintext and I can even publish an HTML version of it. My entire collection of notes should be small enough that I can pull everything down to my phone. Now I just have to extract my data from Evernote. Easy, right?

Evernote doesn’t make a desktop client for linux, so I fired up my Mac Mini since I need to use the desktop client to export my data. I exported each of my notebooks into a separate enex file (Evernote’s XML format). Looking at it, I wonder if it’s even valid XML. How am I going to get my data out of here?

My first move is to install html-xml-utils. After experimenting with hxpipe and hxextract, it seems like html-xml-utils are more about manipulating html/xml and retaining the format, not filtering the data away from the format.

I had a quick chat with tomasino [3] and he referred me to ever2simple [4]. Ever2simple is a tool that aims to help people migrate from Evernote to Simplenote. After some trial and error, I was able to install ever2simple, but I first had to install python-pip, python-libxml2, python-lxml, and python-lxslt.

I’m starting with one of my smallest notebooks, a journal, just so I can prove the concept. I want to migrate these journal entries to my journal.txt file that I maintain with jrnl. [5] I tried the -f

option first, hoping this would just give me a folder full of text files. That’s exactly what it does, but there’s no metadata. I need the timestamps. Using ever2simple with the -f json option gives me my metadata, but now everything is in a huge JSON stream. After some experimentation with sed, I conclude that sed is not the right tool for this job.

I remember hearing about something called jq that should let me work with JSON. The apt package description for jq starts with, “jq is like sed for JSON…”. Well, I’m sold. Also, no dependencies! What a bonus. The man page is full of explanations and examples, but I’m going to need to experiment with the filters. After some experimentation, I land on

jq '.[] | .createdate,.content' journal.json

This cycles through each top-level element and extracts the createdate and content values. Now I wonder how I can add a separator so that I can dissect the data into discrete files with awk or something. I should be able to add a literal to the list of filters.

jq '.[] | .createdate,.content,"%%"' journal.json

Well, the %% lines include the quotes, but that’s not the end of the world. I wonder what date format I need for jrnl. Each jrnl entry starts with

YYYY-MM-DD HH:MM Title

Evernote gives me dates that look like

Jul 25 2011 HH:MM:SS

date –help to the rescue!

Looking at date handling in jq, I should be able to convert the dates from the format used by Evernote to the format used by jrnl with the filter

strptime("%b %d %Y %H:%M:%S")|strftime("%Y-%m-%d %H:%M")

All together, then.

jq '.[] | (.createdate|strptime("%b %d %Y %H:%M:%S")|strftime("%Y-%m-%d %H:%M")),.content,"%%"' journal.json

I still have some garbage in there, but I’m getting close to being able to just prepend this to my journal.txt file. OK, I’m close enough with this:

jq '.[] | (.createdate|strptime("%b %d %Y %H:%M:%S")|strftime("%Y-%m-%d %H:%M")),.content,"%%"' journal.json | sed -e 's/^"//;s/"$//;s/\\n/\n/g' | sed -e '/^ *$/d' >journal.part

Okay, let’s try the recipes notebook. My recipes notebook should be a little more challenging than my journal entries, but it’s not as massive as my main notebook.

ever2simple -f json -o recipes.json recipes.enex

My journal json file was 5k. This one is 105k. Running the same command as before gives me pretty legible output. I know some of these notes had attachments, but I don’t see them in the JSON. I wonder if they are mime-encoded in the XML file.

Looking back at my recipes.enex file, attachments do appear to be base64 encoded in the XML, but ever2simple doesn’t copy this data into the JSON file it creates. This makes sense since its target is Simplenote. Maybe html-xml-utils can help me get these files out.

hxextract 'resource' recipes.enex

It looks like the files are encapsulated within resource elements. The resource element contains metadata about the attachment and the base64-encoded data itself is inside a data element. I can isolate the data using hxselect.

hxselect -c -s '\n\n' data < recipes.enex > recipes.dat

This gives me all the mime attachments in a single file. Each base64-encoded file is separated by two newlines. This doesn’t preserve my metadata, but I’m anxious to get the data out and see what’s in there. Let’s see if I can pipe the first one into base64 -d to decode it. An awk one-liner should let me terminate output at the first blank line.

awk '/^$/ {exit}{print $0}' recipes.dat | base64 -d > testfile

Now I can use file to find out what kind of file it is.

file testfile

This tells me that it’s an image. A JPEG, to be specific, and it’s 300 dpi and 147x127. That seems seems small. I wonder if Evernote encoded all of the images that were in the html pages I saved. Opening the file in an image viewer, I can see that that’s exactly what it is. How many attachments are in there? Could I…

sed -e '/^./d' recipes.dat | wc

Damn, that’s slick. There are 74 files in there. I’ll bet only a handful of them have any value to me. I think the easiest way to go forward is to copy each base64 attachment into it’s own file. Looking at split(1), it splits on line count, not a delimiter. What if I do something like…

#!/usr/bin/awk -f
BEGIN {fcount=1}
/^$/ {fcount++;}
{ print $0 >> "dump/" fcount ".base64"}

This goes through my recipes.dat file and puts each base64-encoded attachment into its own file. Now I need to decode them and give them an appropriate suffix.

#/bin/bash
for f in dump/*
do
  outfile="${f%.*}.out"
  base64 -d "${f}" > "${outfile}"
  type=$(file ${outfile})
  type="${type#* }"
  type="${type%% *}"
  newout="${outfile%.out}.${type}"
  mv "$outfile" "$newout"
done

Phew! Now I have 74 files to look through. Most of these are garbage from web pages I saved. There’s really only five of these that I want to keep. There are a few problems with this approach:

  • I lose the original file name.
  • I use the file utility to reconstruct the filename extension.
  • I lose the association between the file and the note I saved it in.

Looking at my main notebook, I may revisit ever2simple’s -f

option. I could even look at the source and see if there’s a way to tack on metadata.

I assume there are better ways to go about this, but I love challenges like this because it’s an excuse to learn new tools and get better at using the tools I’m already familiar with. Next time, I’ll show you what happens next, and how I migrate this information to vimwiki.

References

  1. http://www.geeknote.me/
  2. https://vimwiki.github.io/
  3. gopher://gopher.black
  4. https://vimwiki.github.io/
  5. http://jrnl.sh