From Del.icio.us to Pinboard.in with Python 20 Comments

The sad news that Yahoo plans to shut down del.icio.us reached me this week (although theres still hope). I use del.icio.us pretty much every day and was a little traumatized upon hearing this. Once I had finished wailing and gnashing my teeth I set out looking for somewhere to go.

There are many bookmarking sites/services out there, but I fear change, and pinboard.in seemed like the closest thing to a plain replacement. It even supports the same API as del.icio.us. Theres a small charge for signing up, but no recurring fee, so I broke out the credit card and joined up.

The next step was to figure out how to migrate my bookmarks. del.icio.us provides a export to html feature in its settings area, but a quick look at the export revealed some data was missing (mostly extended descriptions). Rabid googling revealed a lesser known XML export mechanism. To use it visit https://api.del.icio.us/v1/posts/all , enter your username and password and save the resulting XML file.

Now to get my bookmarks into pinboard.in. I broke out my trusty text editor and battered together the script below which works just fine, a few hours later all my bookmarks are in pinboard.in, their bookmarklets are installed in my browser, and I'm loving their read later features. Sean is a happy geek again.

You can download my migration script. To use it :

python delmigrate.py backup.xml username password

Heres the source for the curious.

from xml.dom import minidom
import sys

import urllib
import urllib2
import time

user = sys.argv[2]

password = sys.argv[3]

endpoint = "https://api.pinboard.in"

url = "/v1/posts/add?"

#open the xml file to import from and parse it
f = open(sys.argv[1], "r")

doc = minidom.parse(f).documentElement

#keep count of how many urls have been imported
urlcount = 0

count = 0
ellength = len(doc.childNodes)

failcount = 0
while count < ellength:
    e = doc.childNodes[count]

    if e.nodeType == e.ELEMENT_NODE:
        print "import url %s" % urlcount

        #get the attributes from the xml
        href = e.getAttribute("href")
        description = e.getAttribute("description")
        extended = e.getAttribute("extended")
        tags = e.getAttribute("tag")

        dt = e.getAttribute("time")
        rargs = dict(url=href, description=description, extended=extended,
                        tags=tags, dt=dt)
        shared = e.getAttribute("shared")

        if shared.strip() == 'no':
            rargs['shared'] = 'no'

        #convert them to unicode
        rargs = dict([k, v.encode('utf-8')] for k, v in rargs.items())

        print rargs
        #build the request to send
        #set up http auth for pinboard.in
        #doing this for every request may seem wasteful, but urllib2
        #seems to forget the auth details after a half dozen requests
        # if you dont
        password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
        password_manager.add_password(None, endpoint, user, password)

        auth_handler = urllib2.HTTPBasicAuthHandler(password_manager)
        opener = urllib2.build_opener(auth_handler)

        urllib2.install_opener(opener)


        request = urllib2.Request(endpoint + url + urllib.urlencode(rargs))

        #set the user agent
        request.add_header('User-Agent','SeansDeliciousMigrater')
        try:

            r = opener.open(request)
            #send the request and read the response
            response = minidom.parse(r).documentElement.getAttribute("code")

        except Exception, e:
            response = str(e)

        #if we get an invalid response, abort, proabbly throttled
        if response !="done":
            failcount += 1

            print "Failure: Invalid response: %s" % response
            if failcount > 4:

                print "Aborting: Invalid response %s"
                break
            else:
                print "waiting for 30 seconds and retrying"

                time.sleep(30)
        else:
            failcount = 0

            count += 1
            #put in a delay between requests to reduce odds of throttling
            time.sleep(1)

            urlcount += 1
    else:
        count += 1

print "%s urls imported" % urlcount
The Future comes to pass 0 Comments

All the shares are owned by those companies in equal measure, and I can tell you that their regulations are written in Python.

Charles Stross - Accelerando 2005

We are proposing to require that most ABS issuers file a computer program that gives effect to the flow of funds, or “waterfall,†provisions of the transaction. We are proposing that the computer program be filed on EDGAR in the form of downloadable source code in Python. …

SECURITIES AND EXCHANGE COMMISSION - 17 CFR Parts 200, 229, 230, 232, 239, 240, 243 and 249 Release Nos. 33-9117; 34-61858; File No. S7-08-10 RIN 3235-AK37 ASSET-BACKED SECURITIES - 2010

via Sean McGrath

Streaming uploads to S3 with Python and Poster 4 Comments

Every Amazon S3 library I can lay my hands on (for Python at least), seems to read the entire file to be uploaded into memory before sending it. This might by ok when uploading lots of small files, but I have needed to upload a lot of very large files, and my poor old server would creak under the weight of that kind of memory usage.

I managed to bolt a solution together using urllib2 and poster that has been working reliably for me for the past few months. Im going to show you:

A little about how S3 works

S3 is essentially a big python dictionary in the cloud, you give it a key and a value(file) to store, and later on you can read it back out again. S3 has a nice HTTP API, so you can read and write to the store using standard HTTP libraries.

The area you put your files into is called a bucket. Bucket names (which have restrictions) are globally unique, that is, if you make a bucket called holiday_photos, then no one else using s3 can have a bucket called holiday_photos, which might sound weird, but it has its advantages, you can now access your files from http://holiday_photos.s3.amazonaws.com/. If you set the permissions up so anyone can read the contents of the bucket, the whole world can see you files via http://holiday_photos.s3.amazonaws.com/.

The flip side of this, is that you can upload your files, lets say "meonthebeach.jpg" by using HTTP PUT, in this case PUT to http://holiday_photos.s3.amazonaws.com/meonthebeach.jpg.

When uploading to S3, we need provide a few HTTP headers along with our file data when we PUT.

Date - The current date and time in a specific format, e.g. Wed, 01 Mar 2006 12:00:00 GMT. I generate it with time.strftime("%a, %d %b %Y %X GMT", time.gmtime()) Content-Type - The mime type of the file being uploaded, e.g. text/html. Python's mimetypes module does a good job of guessing this for any given file based on its extension. mimetypes.guess_type(filename)[0] Content-Length - the length of the data to be uploaded according to RFC 2616, if you are uploading the file from disk you can get this with the os modules stat function. os.stat(filename).st_size x-amz-acl - Optional, this tells S3 with default access control policy to use, by default this will be available to the logged in owner of the bucket only, to make it publicly readable set it to public-read

Authorization - This is the tricky one, S3 requires that your PUT request be accompanied by an authorization string in the following format: AWS AWS_ACCESS_KEY_ID:SIGNATURE The AWS_ACCESS_KEY_ID is the one provided to you when you signed up to S3

The signature is a string consisting of several of the headers you are sending, along with the resource you are putting concatenated, and hashed with your AWS Secret access key. Constructing the signature is quite complicated in the general case, so I am going to show a method of generating it for the specific type of upload request we will be making, if you need to send headers that we are not using here, see Amazons Documentation for how to create the Authentication Header.

The signature string consists of

PUT\n\n<content-type>\n<date>\nx-amz-acl:public-read\n<resource>

a code example of creating this

sig_data = "PUT\n\n%s\n%s\nx-amz-acl:public-read\n%s" % (
content_type, date, resource)

We then take this string and create an sha1 hash of it and your secret access key, and base 64 encode it.

 signature = base64.encodestring(
                    hmac.new(
              settings.AWS_SECRET_ACCESS_KEY, sig_data, sha1).digest()
               ).strip()

and thats your signature.

How to use Poster

Poster is a small library that works with urllib2 to allow streaming uploads. All you need to do is import it and call a single function which registers posters custom url openers with urllib2 and you are good to go.

import urllib2

from poster.streaminghttp import register_openers
register_openers()

Secondly we need to tell urllib to use HTTP PUT rather than POST. We do this by creating a request object and overriding the get_method

request = urllib2.Request(url, data=data)

request.get_method = lambda: 'PUT'

And then we can make our request and read the response

response = urllib2.urlopen(request).read()

The last step for use in poster is that rather than data containing the file object to be uploaded, it should return an iterator that provides the file data chunk by chunk. For example.

def read_data(file_object):

    while True:
        r = file_object.read(64 * 1024)

        if not r:
            break
        yield r

f = open("text.txt","r")

data = read_data(f)

data is now a generator that will return our file a line at a time.

A simple script to stream uploads to S3

Below is the source for a simple command line tool that will take a filename bucket name, and amazon credentials and upload the file to the bucket making it publicly readable

import os

import sys
import time
import base64
import hmac
import mimetypes

import urllib2

from hashlib import sha1

from poster.streaminghttp import register_openers

def read_data(file_object):
    while True:
        r = file_object.read(64 * 1024)

        if not r:
            break
        yield r

def upload_file(filename, bucket, AWS_ACCESS_KEY_ID, 
              AWS_SECRET_ACCESS_KEY):
    length = os.stat(filename).st_size
    content_type = mimetypes.guess_type(filename)[0]
    resource = "/%s/%s" % (bucket, filename)

    url = "http://%s.s3.amazonaws.com/%s" % (bucket, filename)

    date = time.strftime("%a, %d %b %Y %X GMT", time.gmtime())

    sig_data = "PUT\n\n%s\n%s\nx-amz-acl:public-read\n%s" % (
                                            content_type, date, resource)
    signature = base64.encodestring(
                hmac.new(
                    AWS_SECRET_ACCESS_KEY, sig_data, sha1).digest()).strip()

    auth_string = "AWS %s:%s" % (AWS_ACCESS_KEY_ID, signature)

    register_openers()
    input_file = open(filename, 'r')

    data = read_data(input_file)
    request = urllib2.Request(url, data=data)

    request.add_header('Date', date)
    request.add_header('Content-Type', content_type)

    request.add_header('Content-Length', length)
    request.add_header('Authorization', auth_string)

    request.add_header('x-amz-acl', 'public-read')
    request.get_method = lambda: 'PUT'

    urllib2.urlopen(request).read()

if __name__ == "__main__":

    filename = sys.argv[1]
    bucket = sys.argv[2]

    AWS_ACCESS_KEY_ID = sys.argv[3]
    AWS_SECRET_ACCESS_KEY = sys.argv[4]

    upload_file(filename, bucket, AWS_ACCESS_KEY_ID, 
             AWS_SECRET_ACCESS_KEY)
Send me your OPML 1 Comments

I used to work with a guy (Hi Daniel) who got everyone he knew to send him OPML files from their RSS readers so he could find new gems to subscribe to. Im feeling kind of bored at the moment. So I am going to repeat his experiment. Anyone who reads this, or sees the related tweet, please send me your OPML file. If your RSS reader makes it difficult to export a list of links, then by all means send them in whatever format you like.

In a weeks time, I'll take the results, crunch em a little, and put them up for all to see. So you can get the benefit too. My email address can be grabbed from the contact link to the left. Come on, send me your links!

For the curious, here is my current list of feeds.

Readability 0 Comments

Readability is a bookmarklet that removes clutter from webpages to make them more readable. I read from computer screens a lot, but when it comes to longer text I actually prefer to read from the tiny screen on my mobile phone than from a laptop monitor.

I recently began a little reading on Typography, and learned of the concept of the comfortable measure. Essentially, approximately 66 characters per line is regarded as the ideal width for readable text.

While Readability does not hit that mark exactly, its a lot closer than the average over wide web layout. Give it a try, it can return a lot of the pleasure of reading to computers.


You are viewing a mobilized version of this site...
View original page here

Mobilized by Mowser Mowser