Cloud Computing: could Windows Azure catch up with Amazon?
There was quite an interesting event in the world of cloud computing recently - this month Amazon has announced a new service called Elastic MapReduce.
Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.
For customers this means:
Do you have some CPU-intensive task that you want to run in the cloud? Perhaps, stock market analysis, sales optimization or a video rendering? You can run this task on Amazon machines, while paying only for the CPU hours being consumed.
And this recent offering of Elastic MapReduce significantly simplifies the job by offering an integrated environment that takes away the burden of performing complex tasks related to the configuration, scheduling and management of virtual clusters (slashing down the cost of delivering such a project). Developers could focus on the actual problems instead.
This seems like quite a good deal for start-up companies that have some computations to run but can't (yet) afford the luxury of owning a dedicated data center.
Another nice thing about this Elastic MapReduce thing is that it didn't require a huge investment from Amazon. They have simply adopted a widely used open source project called Apache Hadoop.
What's the situation with the cloud computing in .NET world?
Unfortunately (as it happens with a lot of IT innovations) there will be some time before .NET world could benefit from such a technology. Amazon Elastic MapReduce currently supports only "Java, Ruby, Perl, Python, PHP, R, or C++" (although they might add .NET support by plugging Mono into their Linux machines with Hadoop streaming).
The primary commercial .NET cloud computing provider is Windows Azure that is planned to hit production this fall.
At the moment Azure even does not have any capabilities to control virtual clusters programmatically (unless you consider Web Dashboard to be some form of API), but this could be fixed easily. When this happens, Windows Azure would have a feature set comparable to Amazon EC2 and S3 without the Elastic Map Reduce.

This last critical piece missing from .NET world is Hadoop equivalent. Basically it is the gateway into efficient cloud computing for the small companies (we assume that such companies can't afford having a dedicated team of professional developers to re-implement a cluster management and scheduling framework for Azure).
I'm not aware of any open source .NET equivalent of Hadoop out there (although there are a few dead attempts at Google Code).
At the same time, Microsoft owns a quite interesting research project Dryad that comes with a LINQ provider called (as you would guess) DryadLINQ.
This project deals with the task of executing an arbitrary computational task (mostly distributed data processing) over the dedicated cluster in a controlled and reliable fashion. The DryadLINQ provider is responsible for making this complex process simple enough that even interns could develop, debug and execute an efficient distributed computation over the cluster.
Unfortunately, these two projects are not open source. They also target cluster environment which is different from the cloud computing environment (in being more predictable and manageable). This makes it more difficult to bring Dryad into the Azure world.
Ceteris paribus, it looks like some time shall pass before either .NET community comes up with NHadoop implementation or Microsoft brings their Dryad project from the forest and closer into the sky.
In either case such a project would involve dealing with a few quite interesting challenges:
- resource allocation and de-allocation within the cloud (to minimize the monthly bill);
- creating an efficient execution plan for the cloud and actually executing it;
- data distribution and management;
- handling various failures in the communications, computing or storage areas;
- adapting the execution plan to ever-changing (weather) conditions in the cloud (one could call it "on-the-fly DAG evolution").
Only time will tell whether such a functionality could be delivered to Windows Azure and when this would happen. And the first player to do that could get some nice victory points.
So what do you think about such a picture of cloud computing in .NET world? What important factors or pieces did I miss from it? Do you need such a functionality to be available for you projects?
PS: Discussion goes on in Why is Cloud Computing important for us?
Sunday, April 12, 2009 at 1:27
Reader Comments (4)
Good post.
I think MS knows it has to compete in cloud computing, so if Elastic MapReduce is successful for Amazon, I would not be surprised if MS releases DryadLINQ.
I haven't experimented with cloud computing yet, so I'm not sure what's missing.
Have you found many limitations with Azure over a local SQL Server/IIS deployment for a website?
Vijay,
I haven't worked with Azure Web Roles, just with the Worker ones. Based on that experience - all logic for the persistence access (binary and tables) along with messaging has to be based on principles that are quite different from local IIS/SQL. That's the part of the cost for deploying into the cloud.
Rinat, another very interesting post.
I have a split focus - in my day job I write software for a large organisation, in my spare time I write for much smaller private clients. Budgets and resources are at two sides of the scale.
From both sides of this view I do not think that availability of serious distributed computing resources is going to be a determining factor to success. How many small businesses need a Cray and how many larger businesses that need one and can't afford one are there?!?!
Generally, businesses need a data storage and software engines to run day-to-day tasks (query stock, take order, generate invoice...). Even large scale business intelligence engines would not enter the area where they need distributed computing - it's too specialised an area.
Both Amazon and Azure have very usable offerings - service bus/message/data storage - which I can easily see myself using these in my smaller scale work. In the larger scale they need a strong business case. Sure, we'll look at swapping BizTalk infrastructure for Dublin (or similiar) where we can. But swapping hosting internally to externally will need to pass the cost-to-QoS test. If it isn't much cheaper/better then it will stay in house as it's known and trusted...it works!
In my smaller scale work, Amazon is too expensive for me to have a Windows server. If Azure (or anyone else) can offer me some sensible developer services - build server or (dare I wish for) hosted TFS then I and many others would jump at it. A reasonable monthly fee along with a CPU time fee would work for me! Along with this would come the developer's live systems...giving developers the space to do their work in the cloud will naturally take their client's systems along with them. Maybe that will be Cloud 2 ;)
Replied with the post))