this blog is

24 August 2009

Nerd Factor X

twitter archive backup


Archiving Tweets

If you've run Damon Cortesi's handy curl command to download all (or the last 3200) tweets from your twitter account, you'll have a directory full of files with names like user_timeline.xml?count=100&page=1. Not only that but they include a large amount of redundant profile stuff in the <user> element. And not only that, but twitter sometimes returns a "Twitter is over capacity" page instead of your tweets.

What we want to do is a) detect any files which don't contain tweets, b) remove the redundant user profile, and c) combine the results into a single file.

Well, friends, here is a shell script to do exactly that. You'll need zsh and xsltproc, both of which are standard on MacOS X and most sane Linuxen.

zsh is needed to sort the input files in numeric, as opposed to lexicographic, order. If you know of a way to do this in bash, let me know...

Output is on stdout, so just redirect to your filename of choice:

$ tweetcombine user_timeline.xml\?count=100\&page=* \
    > tweet_archive.xml

Here's the script:


# Combine all of the twitter user_timeline.xml files specified on the command line into a single output
# Written by Alastair Rankine,
# Licensed as Creative Commons BY-SA

for f in ${(on)*}; do
    [[ -f $f ]] || exit "Not a file: $f"

xsltproc - <<EOF
<?xml version="1.0"?>
<!DOCTYPE inputs [
  <!ATTLIST xsl:stylesheet id ID #REQUIRED>
<?xml-stylesheet type="text/xml" href="#style1"?>

  <xsl:stylesheet id="style1" version="1.0"

    <xsl:output type="xml" indent="yes"/>

    <xsl:template match="*">
        <xsl:copy-of select="@*"/>

    <xsl:template match="statuses">

    <xsl:template match="user"/>

    <xsl:template match="xsl:stylesheet"/>

    <xsl:template match="input">
        <xsl:when test="document(.)/statuses">
          <xsl:apply-templates select="document(.)"/>
          <xsl:message terminate="yes"><xsl:value-of select="."/> does not contain statuses element</xsl:message>

    <xsl:template match="inputs">
      <statuses type="array">


I think this method of sticking filename arguments into an XSL document with an embedded stylesheet is quite a powerful way of processing XML documents with shell scripts. (Probably should put the <input> tags into a separate namespace though...)


Posted by
Aristotle Pagaltzis
2009-08-24 03:40:35 -0500

A weird sort of power.

  1. You can use [01-32] instead of [1-32] to get filenames with correctly sorting names and with the -o switch you can clean up the filenames further.
  2. You can download JSON rather than XML.

Bottom line:

curl -k -u user:pass -o tweets-#1.json '[01-32]'
perl -MJSON::XS -E'@s=map{local@ARGV=$_;@{decode_json<>}}@ARGV;delete@{$_}{qw(user source)}for@s;say encode_json\@s' -- tweets-* | json_xs

Tad easier…

Posted by
2009-08-24 10:31:48 -0500

A weird sort of easier:

% curl -k -u randomphrase:shh -o tweets-#1.json ...
zsh: no matches found: tweets-#1.json
% curl -k -u randomphrase:shh -o tweets-\#1.json ...
% perl -MJSON::XS -E'...' -- tweets-* | json&#95;xs
zsh: command not found: json&#95;xs
Can't locate JSON/ in @INC (@INC contains: ... )
BEGIN failed--compilation aborted.

Despite the snarky comment above, yes it does work nicely after installing libjson-xs-perl.

Another nice to have would be to resolve shortened URLs - this is probably a lot easier to do in Perl than XSLT...

BTW: Your comment was marked as "Very Spammy" by Defensio, and had to be manually rescued. This makes me sad.

Posted by
Aristotle Pagaltzis
2009-08-24 16:14:21 -0500

I concede that JSON::XS is less likely to be installed than libxslt.

I started out with your stylesheet actually, but eventually I got to annoyed at all the effort that XSLT takes for very simple cases like this one.

The deciding factor was JSON + dynamic language, so Ruby would work as well as Perl here; I guess it would look cleaner at the expense of a longer command. (Python’s not much for one-liners, however.) Of course you’d ultimately put this in a script, so that’s neither here nor there.

As for the spamminess, that was probably because somehow all the underscores in my code block got turned into _ character references, and ASCII characters spelled as NCRs is a popular filter blinding technique. (On both sides of the war, actually – we use it against spammers too, c.f. mailto: hiding.)