Fourtytwo is a simple, robust, yet powerful job engine built on top of Windows Azure and SQL Azure.
What it does?
It was designed to process Embarrassingly parallel problems. For example: in the Bioinformatics area, aligning millions of short sequences back to the human reference genome is one of those problems, because short sequences can be aligned one at a time without knowing how others were doing. Thus you can align 1000 sequences on one machine, or get 1000 machines each aligns 1 sequence.
How it does it?
By transforming files. You should have your data in a bunch of files, each file in an 'ideal' size; you should also have a bunch of command-line programs, or scripts, that can be called upon your data files; each program can output yet another data file that can be the input for the next program. That's how you glue your 'workflow' (in this sense Fourtytwo is also a workflow tool).
This is the simplest diagram showing how Fourtytwo works. Program 1 is a 'one to one' transformation; it takes one input file and generate one output file. Program 2 is a 'many to one' transformation; it takes two input files (a pair) and generate one output file. (Note: we can also deal with programs that generating multiple output files, it does not affect the principles in anyway.) In Fourtytwo we define two lines of text to describe P1 and P2 as transformations, we call a set of those a 'recipe'. You can cook your own 'recipe' from scratch or just pick others' ready-made one.
What about Windows Azure?
Windows Azure is Microsoft's Cloud platform. By using it you could have 10s to thousands instances all running - in our case - Fourtytwo. Back to the diagram, if we spin up two instances, each will grab one file to run P1; then one of them will grab the newly generated pair and run P2, while the other will be sitting there idle (yes, and counting sheep). Imagine you have thousands of files that need go through many programs/scripts (we call this a pipeline), it will be a very happy job for Fourtytwo, with 500 instances or 1000, depends on how fast you want it to finish.
What's the point?
Surely I can do this on my local Sun Grid Engine (SGE) or Beowulf cluster or campus grid or national grid, ... , you say. Yes you are correct. But then:
- How big is your cluster? 500 cores, 1000 cores? One of the TOP 500?
- How many users are sharing this? Are you the only one running jobs? Are you the Ironman researcher that other users will play nice with you? Or you are the peacemaker who never mind other people's Python crashed your job that has been running for 21 days? You never fear the IT manager knocking on your door telling you since your job is using 100% CPU that 's overheating the whole system that he has no other choice but to kill it? The question can goes on and on...
- How is your backend storage infrastructure? Having a SAN does not mean it could cope with your I/O demand. Is it up to the job to manage every researcher's 100 TB data or only a bunch of lucky ones?
- So you have an IT office? Their salaries, the server room, the electricity bill, the cooling, the routine management, replacing failed drive from your SAN, etc. All these constitute the FULL COST, or TCO. And what about the downtime? How long is the procurement process? How many years do you need to keep it before you can upgrade? What if it's underpowered? Depends on how many people is sharing this TCO, you might paying a little (thus having a little quota) or paying a lot.
- What if you only need to run a very big job each every 3 year? Will you still spend half a million for a cluster and using it as a heater most of the time?
We say, if you run Fourtytwo on Windows Azure:
- In theory you can have as much core as a Microsoft data centre has. The 3 month free trial account gives you 40 cores; if you email the support team, you can have 100 - 400 cores even as a non-paying customer; Bankers in London normally spin up 4000-5000 at their 'Peak' time and 0 at 'offpeak'; yes you got it, this is the so called 'Utility Model', the way you pay water or electricity.
- You are the only user on your little cloud, full stop.
- Windows Azure Storage is very nice, full stop.
- Just budget some money for your data job, and spend 5 minutes learning how to use Fourtytwo to cook your 'recipe'.
- As above, put a number in your grant application, run the job when you are ready, done.
What about Windows Azure HPC Scheduler SDK?
Great if you know this thing. Windows HPC is the answer of Microsoft to the SGE/Beowulf. Windows Azure HPC Scheduler is the extension from your local HPC to the cloud. The SDK is developer facing code for you to create solutions utilizing the traditional cluster model: as in Windows HPC, submitting jobs that could be MPI based task, or Parametric Sweep task, or Service (SOA) task.
Fourtytwo does things similar as Parametric Sweep or Service (SOA) task, but it was designed purely for Windows Azure Cloud Service and there is no legacy inherited whatsoever. We believe Fourtytwo is much easier to use and no programming is neccessary. More on this to follow.
What about Hadoop?
Hadoop or those Map-Reduce, Big Data stuff, is a computation model. It expects very large text files and split those into 25MB chunks and process on many nodes and finally need to converge thus the 'Reduce'. It was complex by design and lots of JAVA code there. People went crazy about it so they throw every problem to it. Fourtytwo wants lots of not-that-big files and was much more efficient.
Thanks to the whole Windows Azure eco-system, the end point of your 'recipe' does not need to be files only. At any point you can divert your data into SQL Azure, thus you'll have a Federated RDBMS to further compute, query, and reporting. You can also connect other Azure Services like Mobile Push or Active Directory for unimaginable powerful solutions that traditional cluster would never dream of. Start your 5 mins now to learn about the 'recipe'.