There was quite an interesting event in the world of cloud computing recently - this month Amazon has announced a new service called Elastic MapReduce.
Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.
For customers this means:
Do you have some CPU-intensive task that you want to run in the cloud? Perhaps, stock market analysis, sales optimization or a video rendering? You can run this task on Amazon machines, while paying only for the CPU hours being consumed.
And this recent offering of Elastic MapReduce significantly simplifies the job by offering an integrated environment that takes away the burden of performing complex tasks related to the configuration, scheduling and management of virtual clusters (slashing down the cost of delivering such a project). Developers could focus on the actual problems instead.
This seems like quite a good deal for start-up companies that have some computations to run but can’t (yet) afford the luxury of owning a dedicated data center.
Another nice thing about this Elastic MapReduce thing is that it didn’t require a huge investment from Amazon. They have simply adopted a widely used open source project called Apache Hadoop.
What’s the situation with the cloud computing in .NET world?
Unfortunately (as it happens with a lot of IT innovations) there will be some time before .NET world could benefit from such a technology. Amazon Elastic MapReduce currently supports only “Java, Ruby, Perl, Python, PHP, R, or C++” (although they might add .NET support by plugging Mono into their Linux machines with Hadoop streaming).
The primary commercial .NET cloud computing provider is Windows Azure that is planned to hit production this fall.
At the moment Azure even does not have any capabilities to control virtual clusters programmatically (unless you consider Web Dashboard to be some form of API), but this could be fixed easily. When this happens, Windows Azure would have a feature set comparable to Amazon EC2 and S3 without the Elastic Map Reduce.
This last critical piece missing from .NET world is Hadoop equivalent. Basically it is the gateway into efficient cloud computing for the small companies (we assume that such companies can’t afford having a dedicated team of professional developers to re-implement a cluster management and scheduling framework for Azure).
I’m not aware of any open source .NET equivalent of Hadoop out there (although there are a few dead attempts at Google Code).
This project deals with the task of executing an arbitrary computational task (mostly distributed data processing) over the dedicated cluster in a controlled and reliable fashion. The DryadLINQ provider is responsible for making this complex process simple enough that even interns could develop, debug and execute an efficient distributed computation over the cluster.
Unfortunately, these two projects are not open source. They also target cluster environment which is different from the cloud computing environment (in being more predictable and manageable). This makes it more difficult to bring Dryad into the Azure world.
Ceteris paribus, it looks like some time shall pass before either .NET community comes up with NHadoop implementation or Microsoft brings their Dryad project from the forest and closer into the sky.
In either case such a project would involve dealing with a few quite interesting challenges:
- resource allocation and de-allocation within the cloud (to minimize the monthly bill);
- creating an efficient execution plan for the cloud and actually executing it;
- data distribution and management;
- handling various failures in the communications, computing or storage areas;
- adapting the execution plan to ever-changing (weather) conditions in the cloud (one could call it “on-the-fly DAG evolution”).
Only time will tell whether such a functionality could be delivered to Windows Azure and when this would happen. And the first player to do that could get some nice victory points.
So what do you think about such a picture of cloud computing in .NET world? What important factors or pieces did I miss from it? Do you need such a functionality to be available for you projects?
PS: Discussion goes on in Why is Cloud Computing important for us?