I’ve been determined to find a reason to use Node.js in a project since, as the adage goes, you have to sart somewhere. What I came up with was something that I’m actually going to use in future projects, I think.

I’ve often wished there were an easy way to give clients access to rake tasks, but in the browser. Rake tasks are great because you can provide real time feedback in the terminal to show the user what’s going on as it’s happening. Node.js along with Socket.io make creating this experience in the browser really easy.

This project is a Rails 3.1 engine, so it’s super easy to hook up in your Rails application. Aside from using Rails 3.1+, you need to have Node.js and Socket.io installed. So to get this running:

In your Gemfile:

gem 'rake_ui', :git => "git://github.com/rbrant/rake_ui.git"

In your routes.rb file:

Rails.application.routes.draw do  
  mount RakeUi::Engine => "/rake_ui"
end

Then you need to start the Node server. From your app’s root:

rake app:start_node_server

Once your Rails app is started, visit /rake_ui to see your rake tasks. All stdout generated in your task via something like:

$stdout.puts "hey now"

will be displayed in the black window to the right of the Rake task command listing.

They way it works is that when the app is initialized, all the available rake tasks are stored in memory along with an ID key. The index page displays all he rake tasks and identifies them by their ID. When a task is selected, it get identified and then called via:

Kernel.system("#{@rake_task.command} #{@rake_task.arguments} &> #{Rails.root}/log/rake.log")

As you can see, stdout is sent to a log file. This log file gets tailed by Node and hooked up to the browser via Socket.io. Very cool, eh? I need to find a real node project..

To make this engine truly useful, I’ll probably add a way to blacklist certain tasks so they aren’t available in the UI. Probably not a good idea to allow your clients to drop their database. I also need to scrub the optional arguments.

The project is available on Github:

https://github.com/rbrant/rake_ui

{ 1 comment }

The problem:
You have a large number of records that need to be modified. You don’t have the processing resources to accomplish this as quickly as you’d like.

The solution:
Break the data into chunked csv files, with each file containing a certain number of records. Spin up a bunch of EC2 instances whereby each one, on startup, processes one file. This also allows you to run a large number of processes concurrently to get the job done quickly. 100 ‘micro’ instances all running at the same time will cost $10 per hour ($.10/hour). That’s not bad!

Creating the chunked files

file_num = 1

Thing.all.in_groups_of(2500, false) do |thing_group|
  csv_data = FasterCSV.generate do |csv|
    thing_group.each_with_index do |thing|
      csv << [thing.attr1, thing.attr2]
    end
  end
end

File.open("chunked_things/file#{file_num}.csv", 'w') {|f| f.write(csv_data) }
file_num += 1

Configure an EC2 instance to serve as the source instance
Configure an EC2 instance from which you will create an AMI to be used to spawn the other worker instances. The important part is to make sure this instance has a copy of all the chunked files created above. I'll explain why later. I used an AMI from bitnami, this one to be specific (listed toward the bottom of the page). It's an EBS backed AMI which makes creating AMIs from running instances easier than from S3 backed ones. This may have changed, not sure.

Use cron to run the script at start up
Create a cron task set to run when the instance is booted. You can use the @reboot shortcut to accomplish this:

@reboot /path/to/ruby /path/to/your/script

The goal here is to have your script run when the server boots. This would be pretty straight forward, however, we need to be sure all the EC2 instances don't process the same file at once, or process the same file twice. Enter SQS.

SQS
SQS basically allows you to create a simple queue that holds messages. For us, these messages take the format of the names of all the files we want to process. When the server boots, and the script runs, it hits the SQS service and asks for the name of a file. SQS responds with a message off the queue (which is the name of a file). That message is locked so no other requests will respond with it. Once the message is retrieved, you can pop is off the queue. Each spawned instance is guaranteed to get it's own filename.

right_aws is an awesome wrapper for AWS. The code to create the create the queue looks like this:

sqs   = RightAws::SqsGen2.new(aws_key, aws_secret)
queue = RightAws::SqsGen2::Queue.create(sqs, 'chunk_queue')

Populate the queue with your filenames

queue.push 'your_filename'

Processing the file
To get a filename from the queue:

chunked_filename = queue.pop.to_s

Now we have the file with CSV data to process. Loop through it, hitting service you need to.

Push your modifed data back to S3
When you are finished processing your file, you'll want to put the results somewhere. S3 is an obvious choice. Again with right_aws, it's pretty easy. This will basically write your data back to a file on S3:

s3     = RightAws::S3.new(aws_key, aws_secret)
bucket = s3.bucket(s3_bucket_name)
key    = RightAws::S3::Key.create(bucket, "#{s3_bucket_key}#{chunked_filename}")
key.put(your_data)

Altogether now, this is some pseudo code(not tested) that addresses the core parts of the process. This is the script that would be executed at start up:

require 'rubygems'
require 'right_aws'
require 'fastercsv'

# aws
aws_key    = 'your key'
aws_secret = 'your secret'

# s3 bucket
s3_bucket_name = 'your_bucket'
s3_bucket_key  = 'name_of_key'

sqs   = RightAws::SqsGen2.new(aws_key, aws_secret)
queue = RightAws::SqsGen2::Queue.create(sqs, 'chunk_queue')

# pop a file anme off the queue
chunked_filename = queue.pop.to_s

modified_data = []

# process the filename SQS gave us (a copy of all files is on each instance)
file_to_process = "#{File.expand_path(File.dirname(__FILE__))}/#{chunked_filename}"

FasterCSV.foreach(file_to_process) do |row|
  data_for_api = "#{row[0]}, #{row[1]}, #{row[2]} #{row[3]}"
  results  = # hit your web service or do what you need to here

  # use api results
  modified_data << [results.value1, results.value2]
end

# generating csv data
csv_data = FasterCSV.generate{ |csv| modified_data.each{ |modified| csv << modified } }

# put the file on S3
s3       = RightAws::S3.new(aws_key, aws_secret)

# grab the bucket
bucket   = s3.bucket(s3_bucket_name)

# create the S3 key where the csv data will be stored
key      = RightAws::S3::Key.create(bucket, "#{s3_bucket_key}#{chunked_filename}")

# write the data to S3
key.put(csv_data)

{ 1 comment }

Crash course in Firefox extension development – adding the new ‘micro’ instance type to the elasticfox extension

September 14, 2010

Click here to download: ec2u-rbrant.xpi (419 KB) So I’m all set to take advantage of Amazon Web Service’s new ‘micro’ instance type on EC2 (yup, another client moving to EC2). I need to manage it via the elasticfox Firefox extension, but the micro instance type isn’t an option yet in the new instance dialog. Since [...]

Read the full article →

ActsAsCachola – simple caching for AR models (everyone could use a little cachola)

July 16, 2010

And you thought all the clever caching names were taken. What is it ActsAsCachola is a plugin that lets you cache any class method by simply prepending ‘cachola_’ to the method name when calling it. Here’s how it works: Given the following model: class Internet < ActiveRecord::Base acts_as_cachola def self.get_a_million_numbers 1.upto(1_000_000).inject([]){ |numbers, x| numbers

Read the full article →

I want to show you some SAX.

March 12, 2010

I had to process some pretty big xml docs recently from the USPTO. Each doc is about 60mb and (oddly enough) contains several thousand individual documents all concatenated. So the document isn’t valid xml..but that’s a different story. The reason for writing this was to show a quick demo of how to use SAX to [...]

Read the full article →

Good stuff from Bob Martin on software. Like the stuff at 11:20 inre tech/biz disconnect. Not sure about the shirt, however.

February 17, 2010

Posted via email from Rich’s posterous

Read the full article →

Seth Godin’s ebook, ‘What matters now’ Free download. Tons of great ideas/things to think about.

December 15, 2009

Download now or preview on posterous what-matters-now-2.pdf (3073 KB) Posted via email from Rich’s posterous

Read the full article →

Project review: ifwinsight.com

December 6, 2009

It’s easy to forget what you’ve learned and what tools you used from project to project. I thought it might be worthwhile to sort of sum up these things either on a weekly basis or project basis. I had a lot of fun on a recent project and thought it would be a good place [...]

Read the full article →

upgrading to snow leopard – issues with mysql and gems

August 30, 2009

Was definitely worth upgrading to snow leopard from leopard. My machine is noticeably quicker and I picked up quite a bit of new storage space, but it was not without its issues on the dev side of things. The most affected area so far has been mysql. You need to get the 64 bit version: [...]

Read the full article →

Awesome talk at biz of software conf: Seth Godin on why marketing is too important to be left to the marketing department

July 31, 2009
Read the full article →


You are viewing a mobilized version of this site...
View original page here

Mobilized by Mowser Mowser