Friday, July 26, 2013

Avoiding the Risks of Cloud


It's no secret that cloud computing has transformed the way enterprises do business. It has changed the way developers write software and users interact with applications. By now, almost every business organization has a strategy on how to adopt the cloud. Those who don’t will soon be extinct. The influence of the cloud has been so phenomenal, that it truly has turned into a "take it or die" kind of a deal over the last few years.
It is also no secret that today the cloud movement is steered by a handful of giants in the IT industry. Companies like Amazon, Google, Microsoft and Salesforce are clearly among this elite group. These companies, their products and vision have been instrumental in the introduction, evolution and the popularization of the cloud technology. 
With that being the case, we must think about the implications of cloud computing on the current IT landscape of the world. Are all S&M organizations around the world going to get rid of their server racks and transfer their IT infrastructure to Amazon EC2? Are all Web applications and mobile applications going to be based on Google App Engine APIs? Are all enterprise data going to end up in Amazon S3 and Google Megastore? What sort of defenses are in place to prevent a few IT giants from monopolizing the entire IT infrastructure and services market? How easy it would be for us to migrate from one cloud vendor to another? All these are indeed very real and very important problems that all organizations should take under careful consideration.
Fortunately there are several practical solutions to all the above issues. One is openness and standardization. Cloud platforms that are based on open standards and protocols should be preferred over those that use proprietary standards and protocols. Open standards and protocols are likely to be supported by more than just one cloud vendor thus enabling the users to migrate between different vendors easily. Also, in many cases open standards make it easier to port existing standalone applications to the cloud. Take a Java web application for an example. Most Java web applications are based on the J2EE suite of standards (JSP, Servlets, JDBC etc.). If the target cloud platform also supports these open standards, the user can easily migrate his J2EE app to the cloud without having to make too many changes. Similarly he can easily migrate the app from one cloud platform to another as long as both platforms support the same J2EE standards. 
Speaking of openness, cloud platforms that are open source and distributed under liberal licenses should get extra credit over closed source ones. Open source cloud platforms allow the user to modify and shape the platform according to the user requirements, rather than forcing the user to change their apps according to the changes made by the cloud platform vendor. Also, with an open source cloud framework, users will be in a position to maintain and support the platform on their own, in a situation where the original vendor decides to discontinue support for the platform.
Another possible solution is to use a hybrid cloud approach instead of solely relying on a remote public cloud maintained by a third party vendor. A hybrid cloud approach typically involves a private cloud maintained by the user, and then selectively bursting into the public cloud to handle high availability and high scalability scenarios. This method does involve some additional expenses and legwork on the user's part but the user ultimately remains in control of his data and applications, and no third party vendor can take that away from the user. Also as far as most S&M organizations are concerned, what they expect from the cloud are features like multi-tenancy, self-provisioning, optimal resource utilization and auto-scaling. Spending a few bucks on running a server rack or two to make that happen is usually not a big deal. Most companies do that today anyway. However, from a technical standpoint, we need easy-to-deploy, easy-to-maintain and reliable private cloud frameworks, which are compatible with popular public cloud platforms to really take advantage of this hybrid cloud model. Fortunately, thanks to some excellent work by a few start-ups like Eucalyptus and AppScale, this is no longer an issue. These vendors provide highly competitive private cloud and hybrid cloud solutions that are fully compatible with widely used public cloud platforms such as AWS and Google App Engine. If the user is capable of procuring the necessary hardware resources and manpower, these cloud platforms can even be used to setup fully-fledged private clouds that have all the bells and whistles of popular public clouds. That’s a great way to bask in the glory of the cloud, while maintaining full ownership and control over your enterprise IT assets.
Software frameworks like Apache JClouds provide another approach for dealing with potential risks of the cloud. These software frameworks allow user's code to interact with multiple heterogeneous cloud platforms by abstracting out the differences between various clouds. If we consider JClouds, as of now it supports close to 30 different cloud platforms including AWS, OpenStack and Rackspace. This implies that any application written using JClouds can be executed on around 30 different cloud platforms without having to make any code changes. As the influence of the cloud continues to grow, developers should seriously consider writing their code using high-level APIs like JClouds, without getting tied into a single specific cloud platform.
Cloud has certainly changed the way we all think about IT and computing. While its benefits are quite attractive, it also comes with a few potential risks. Users and developers should think carefully, plan ahead and take preventive action soon to avoid these pitfalls.

Friday, June 21, 2013

White House API Standards, DX and UX

The White House recently published some standards for developing web APIs. While going through the documentation, I came across a new term - DX. DX stands for developer experience. As anybody would understand, providing a good developer experience is the key to the success of a web API. Developers love to program with clean, intuitive APIs. On the other hand clunky, non-intuitive APIs are difficult to program with and usually are full of nasty surprises that make the developer's life hard. Therefore DX is perhaps the single most important factor when it comes to differentiating a good API from a not-so-good API.
The term DX reminds me of another similar term - UX. As you would guess UX stands for user experience. A few years ago UX was one of the most exciting topics in the IT industry. For a moment there everybody was talking and writing about UX and how websites and applications should be developed with UX best practices in mind. It seems with the rise of the web APIs, cloud and mobile apps, DX is starting to generate a similar buzz. In fact I think for a wide range of application development, PaaS, web and middleware products DX would be way more important than UX. Stephen O'Grady was so right. Developers are the new kingmakers

Wednesday, June 19, 2013

Is Subversion Going to Make a Come Back?

The Apache Software Foundation (ASF) announced the release of Subversion 1.8 yesterday. As I started to read the release note, I started wondering how come Subversion is still alive. The ASF heavily use Subversion for pretty much everything. In fact the source code of Subversion is also managed using a Subversion repository. But outside the ASF I've seen a strong push towards switching from Subversion to Git. Most startups and research groups that I know of have been using Git from day one. WSO2, the company I used to work for, is in the process of moving their code to Git. Being an Apache committer I obviously have to use Subversion regularly. But about a year ago I started using Git (GitHub to be exact) for my other development activities, and I absolutely adore it. It scales well for large code bases and large development teams, and it makes common tasks such as merging, reverting, reviewing other people's work and branching so much easier and intuitive. 
But as it turns out Subversion is still the world's most widely used source version control system. As declared in the official blog post rolled out by the ASF yesterday, a number of tech giants including WordPress heavily use Subversion. According to Ohloh, the percentage of open source projects that use Subversion is around 53%, compared to the 29% that use Git. Looks like Subversion has managed to capture quite a share of the market making it a very hard-to-kill technology. It would be interesting to see how the competition between Subversion and Git would unfold in the future. It seems the new release comes with a bunch of new features, which indicates that the project is very much alive and kicking and the Subversion community is not even close to giving up on the project.

Friday, June 14, 2013

More Reasons to Love Python - A Lesson on KISS

Recently I've been doing some work in the area of programming language design. At one point I wanted to define a Python subset which allows only the simplest Python statements without loops, conditionals, functions, classes and a bunch of other high-level constructs. So I looked into the grammar specification of the Python language and I was astonished by its simplicity and succinctness. Click here to take a look for yourself. It's no longer than 125 lines of text, and the whole thing can be printed on one side of an A4 sheet. This is definitely one of those instances where the best design is also the simplest design. No wonder everybody loves Python.
However that's not the whole point. Having selected a suitable Python subset, I was looking into ways for implementing a simple parser for those grammar rules. I've done some work with JavaCC in the past, so I straightaway jumped into implementing a Java-based parser for the selected Python subset using JavaCC. After a few hours of coding I managed to get it working too. The next step of my project required me to do some analysis on the abstract syntax tree (AST) produced by the parser. I was looking around for some existing work that fits my requirements, and I came across Python's native ast module. I immediately realized that all those hours I spent on implementing the JavaCC-based parser is a complete waste. The ast module provides excellent support for parsing Python code and constructing ASTs. This is all you have to do parse some Python code using the ast module and obtain an AST representation of the code.
import ast

# The variable 'source' contains the Python statement to be parsed
source = 'x = y + z'
tree = ast.parse(source)
The ast module supports several modes. The default mode is exec which supports parsing a sequence of Python statements. The module also supports a special eval mode which can be used to parse simple one-liner Python statements. It turned out the eval mode supports more or less the same exact Python subset I wanted to use. So I threw away my JavaCC-based parser and wrote the following snippet of Python code to get my job done.
import ast

# The variable 'source' contains the Python statement to be parsed
source = 'x = y + z'
tree = ast.parse(source, mode='eval')
Now when it came to analyzing the AST produced by the parser, the ast module again turned out to be useful. The module provides two helper classes, namely NodeVisitor and NodeTransformer which can be used to either traverse or transform a given Python AST. To use these helper classes, we just need to extend them and implement the appropriate visit methods. There's a unique top level visit method and one visit_ method per AST node type (e.g. visit_Str, visit_Num, visit_BoolOp etc.). Here's an example NodeVisitor implementation, that flattens a given Python AST into a list.
class NodeEnumerator(ast.NodeVisitor):
  def get_node_list(self, tree):
    self.nodes = []
    self.visit(tree)
    return self.nodes

  def visit(self, node):
    self.generic_visit(node)
    self.nodes.append(node)
These helper classes can be used to do virtually anything with a given AST. If you want you can even implement a Python interpreter in Python using this approach. In my case I'm running some search and isomorphism detection algorithms on the Python AST's.
So once again I've been pleasantly surprised and deeply impressed by the simplicity and richness of Python. It looks like the designers of Python have thought of everything. Kudos to Python aside, this whole experience taught me to always looks for existing, simple solutions before doing it in my own complicated way. It actually reminds me of the good old KISS principle - "Keep It Simple, Stupid". 

Friday, April 5, 2013

MDCC - Strong Consistency with Performance

A few weeks back me and a couple of my colleagues finished developing a complete implementation of the MDCC (Multi-Data Center Consistency) protocol. MDCC is a fast commit protocol proposed by UC Berkeley for large-scale geo-replicated databases. The main advantage of MDCC is that is supports strong consistency for data while providing transaction performance similar to eventually consistent systems. 
With traditional distributed commit protocols, supporting strong consistency usually requires executing complex distributed consensus algorithms (e.g. Paxos). Such algorithms generally require multiple rounds of communication. Therefore when deployed in a multi-data center setting where the inter-data center latency is close to 100ms, the performance of the transactions being executed degrades to almost unacceptable levels. For this reason most replicated database systems and cloud data stores has opted to support a weaker notion of consistency. This greatly speeds up the transactions but you always run the risk of data becoming inconsistent or even lost.
MDCC employs a special variant of Paxos called Fast Paxos. Fast Paxos takes a rather optimistic approach by which it is able to commit most transactions within a single network roundtrip. This way a data object update can be replicated to any number of data centers within a single request-response window. The protocol is also effectively masterless which means if the application is executing in a data center in Europe, it does not have to contact a special master server which could potentially reside in a data center in USA. The only time this protocol doesn't finish within a single request-response window is when two or more transactions attempt to update the same data object (transaction conflict). In that case a per-object master is elected and the Classic Paxos protocol is invoked to resolve the conflict. If the possibility of a conflict is small, MDCC will commit most transactions within a single network roundtrip thus greatly improving the transaction throughput and latency. 
Unlike most replicated database systems, MDCC doesn't require explicit sharding of data into multiple segments. But it can be supported on MDCC if needed. Also unlike most cloud data stores, MDCC has excellent support for atomic multi-row (multi-object) transactions. That is multiple data objects can be updated atomically within a single read-write transaction. All these interesting properties make MDCC an excellent choice for implementing powerful database engines for modern day distributed and cloud computing environments.
Our implementation of MDCC is based on Java. We use Apache Thrift as the communication framework between different components. ZooKeeper is used for leader election purposes (we need to elect a per-object leader whenever there is a conflict). HBase server is used as the storage engine. All the application data and metadata are stored in HBase. In order to reduce the number of storage accesses we also have a layer of in-memory caching. All the critical information and updates are written through to the underlying HBase server to maintain strong consistency. The cache still helps to avoid a large fraction of storage references. Our experiments show that most read operations are able to complete without ever going to HBase layer. 
We provide a simple and intuitive API in our MDCC implementation so that users can write their own applications using our MDCC engine. A simple transaction implementing using this API would look like this.
        TransactionFactory factory = new TransactionFactory();
        Transaction txn = factory.create();
        try {
            txn.begin();
            byte[] foo = txn.read("foo");
            txn.write("bar", "bar".getBytes());
            txn.commit();
        } catch (TransactionException e){
            reportError(e);
        } finally {
            factory.close();
        }
We also did some basic performance tests on our MDCC implementation using the YCSB benchmark. We used 5 EC2 micro instances distributed across 3 data centers (regions) and deployed a simple 2-shard MDCC cluster. Each shard consisted of 5 MDCC storage nodes (amounting to a total of 10 MDCC storage nodes). We ran several different types of workloads on this cluster and in general succeeded in achieving < 1ms latency for read operations and < 100ms latency for write operations. Our implementation performs best with mostly-read workloads, but even with a fairly large number of conflicts, the system delivers reasonable performance. 
Our system ensures correct and consistent transaction semantics. We have excellent support for atomic multi-row transactions, concurrent transactions and even some rudimentary support for crash recovery. If you are interested to give this implementation a try, grab the source code from https://github.com/hiranya911/mdcc. Use Maven3 to build the distribution, extract and run.

Monday, March 11, 2013

Starting HBase Server Programmatically

I'm implementing a database application these days and for that I wanted to programmatically start and stop a standalone HBase server. More specifically I wanted to make HBase server a part of my application so that whenever my application starts, HBase server also starts up. This turned out to be more difficult than I thought it would be. To start a HBase server you actually need to start three things:
1. HBase master server
2. HBase region server
3. ZooKeeper
The default startup script shipped with the HBase binary distribution does all this for you. But I wanted a more tightly integrated and a fully programmatic solution. Unfortunately the HBase public API doesn't seem to expose the functionality required for programmatically starting and stopping the above components (at least not in a straightforward manner). So after going through the HBase source and trying out various things, I managed to come up with some code that does exactly what I want. At a high level, this is what my code does:
1. Create an instance of HQuorumPeer  and execute it on a separate thread.
2. Create an initialize a HBaseConfiguration instance.
3. Create an instance of HMaster and execute it on a separate thread.
4. Create an instance of HRegionServer and execute it on a separate thread.
Both HMaster and HRegionServer implement the Runnable interface. Therefore it's easy to run them on separate threads. I created a simple Java Executor instance and scheduled HMaster and HRegionServer for execution on it. But HQuorumPeer was a bit tricky. This class only contains a main method and has no such thing called a public API. So one solution is to create your own thread class, which simply invokes the above mentioned main method. The other option is to write your own HQuorumPeer class implementing the Runnable interface. The original HQuorumPeer class from the HBase project is fairly small and contains only a small amount of code. So I  took the second approach. I simply copied the code from the original HQuorumPeer and created my own HQuorumPeer implementing the Runnable interface. Overall this is what my finalized code looks like:
        
        exec.submit(new HQuorumPeer(properties));
        log.info("HBase ZooKeeper server started");
        
        Configuration config = HBaseConfiguration.create();
        File hbaseDir = new File(hbasePath, "data");
        config.set(HConstants.HBASE_DIR, hbaseDir.getAbsolutePath());
        for (String key : properties.stringPropertyNames()) {
            if (key.startsWith("hbase.")) {
                config.set(key, properties.getProperty(key));
            } else {
                String name = HConstants.ZK_CFG_PROPERTY_PREFIX + key;
                config.set(name, properties.getProperty(key));
            }
        }

        try {
            master = new HMaster(config);
            regionServer = new HRegionServer(config);
            masterFuture = exec.submit(master);
            regionServerFuture = exec.submit(regionServer);
            log.info("HBase server is up and running...");
        } catch (Exception e) {
            handleException("Error while initializing HBase server", e);
        }
Then I nicely wrapped up all this logic into a single reusable util class called HBaseServer. So whenever I want to start/stop HBase in my application, this is all I have to do.
HBaseServer hbaseServer = new HBaseServer();
hbaseServer.start();
Hope somebody finds this useful :)

Tuesday, February 5, 2013

How the World's Fastest ESB was Made

A couple of years ago, at WSO2 we implemented a new HTTP transport for WSO2 ESB. Requirements for this new transport can be summarized as follows:
  1. Ultra-fast, low latency mediation of HTTP requests.
  2. Supporting a very large number of inbound (client-ESB) and outbound (ESB-server) connections concurrently (we were looking at several thousand concurrent connections).
  3. Automatic throttling and graceful performance degradation in the presence of slow or faulty clients and servers.
The default non-blocking HTTP (NHTTP) transport from Apache Synapse, which we were also using in WSO2 ESB, supported the above requirements up to a certain extent but we wanted to do better. The default transport was very generic and it was designed to offer reasonable performance in all the integration scenarios the ESB could potentially participate in. However HTTP load balancing, HTTP URL routing (URL rewriting) and HTTP header-based routing are some of the most widely used integration patterns in the industry and to support these use cases well, we needed a specialized transport. 
The old NHTTP transport was based on a dual buffer model. Incoming message content was placed in a SharedInputBuffer and the outgoing message content was placed in a SharedOutputBuffer. Apache Axiom, Apache Axis2 and the Synapse mediation engine sit between the two buffers, reading from the input buffer and writing to the output buffer. This architecture is illustrated in the following diagram.
The key advantage of this architecture is that it enables the ESB (mediators) to intercept all the messages and manipulate them in any way necessary. The main downside is every message happens to go through the Axiom layer, which is not really necessary in cases like HTTP load balancing and HTTP header-based routing. Also the overhead of moving data from one buffer to another was not always justifiable in this model. So when we started working on the new HTTP transport we wanted to get rid of these limitations. We knew that this might result in a not-so-generic HTTP transport, but we were willing to pay that price at the time.
So after some very interesting brainstorming sessions, an exciting 1-week long hackathon followed by several months of testing, bug-fixing and refactoring we came up with what’s today known as the HTTP pass-through transport. This transport was based on a single buffer model and completely bypassed the Axiom layer. The resulting architecture is illustrated below.
The HTTP pass-through transport was first released in June 2011 along with WSO2 ESB 4.0. Back then it was disabled by default and the user had to enable it by uncommenting a few entries in the axis2.xml file. The performance numbers we were seeing with the new transport were simply remarkable. WSO2 also published some of these benchmarking results in a March 2012 article. However at this point the 2 main limitations in the new transport were starting to give us headaches.
  1. Configuration overhead (Users had to explicitly enable the transport depending on their target use cases)
  2. Cannot support any integration scenario that requires HTTP content manipulation (because Axiom was bypassed, any mediator attempting to access the message payload would not get anything useful to work with)
In addition to these technical issues there were other process related issues that we had to deal with. For instance maintaining two separate HTTP transports was twice as work for the developers and testers. We found that because the pass-through transport was not used as the default, it often lagged behind the default NHTTP transport in terms of features and stability. So after a few brainstorming sessions we decided to try and make the pass-through transport the default HTTP transport in Apache Synapse/WSO2 ESB. But this required making the content manipulation use cases (content aware use cases) work with the new transport. This implied bringing Axiom back into the picture, the very thing we wanted to avoid in our initial implementation. So in order to balance out our performance and heterogeneous integration requirements we came up with the idea of “on-demand message parsing in the mediation engine”.
In this new model, each mediator instance belongs to one of two classes.
  1. Content-unaware mediators – Mediators that never access the message content in anyway (eg: drop mediator)
  2. Content-aware mediators – Mediators that always access the message content (eg: xslt mediator)
We also identified a third class known as conditionally content-aware mediators. These mediators could be either content-aware or content-unaware depending on their exact instance configuration. For an example a simple log mediator instance, configured as <log/> is content-unaware. However a log mediator configured as <log level=”full”/> would be content-aware since it’s expected to log the message payload. Similarly a simple property mediator instance such as <property name=”foo” value=”bar”/> is content-unaware but <property name=”foo” expression=”/some/xpath”/> could be content-aware depending on what the XPath expression does. In order to capture this content-awareness characteristic of mediator instances at runtime, we introduced a new method (isContentAware) to the top level Mediator interface of Synapse. The default implementation in AbstractMediator class returns true by default so as to maintain backward compatibility. 
With this change in place we modified the mediation engine to check the content-awareness of property of each mediator at runtime before submitting a message to it. List mediators such as the SequenceMediator would run the check recursively on its child mediators to obtain the final value. Assuming that messages are always received through the pass-through HTTP transport, the mediation engine would invoke a special message parsing routine whenever a mediator is detected to be content-aware. It is in this special routine that we bring Axiom into the picture. Therefore if none of the mediators in a given flow or a service is content-aware, the pass-through transport works as it usually does without ever engaging Axiom. But whenever a content-aware mediator is involved, we bring Axiom in. This way we can reap the performance benefits of the pass-through transport while supporting all integration scenarios of the ESB. Since we engage Axiom on-demand we get the best possible outcome for all scenarios. For instance a simple pass through proxy would always work without any Axiom interactions. An XSLT proxy that transforms requests would engage Axiom only in the request flow. Response flow would operate without parsing the messages.
Another tricky problem we encountered was dealing with message parsing itself. For instance how do we parse a message and then send it out when there is only one buffer provided by the underlying pass-through transport? Ideally we need two buffers to read the incoming message from and write the outgoing message to. Also the fact that the Axis2 message builder framework can only handle streams posed a few problems. The buffer we maintained in the pass-through transport was a Java NIO ByteBuffer instance. So we needed to adapt the buffer into a stream implementation whenever the mediation engine engages Axiom. We solved the first problem by implementing our message builder routine to create a second output buffer whenever Axiom is dragged into the picture. The outgoing messages are serialized into this second buffer and the pass-through transport was modified to pick the outgoing content from the second buffer when it’s available. Writing an InputStream implementation that can wrap a ByteBuffer instance solved the second problem.
One last problem that needed to be solved was handling security. In Synapse/WSO2 ESB, security is handled by Apache Rampart, which runs as an Axis2 module that intercepts the messages before they hit the mediation engine. So on-demand parsing at the mediation engine doesn’t work in this scenario. We need to parse the messages before Rampart intercepts them. We solved this issue by introducing a new smart handler to the Axis2 handler chain, which intercepts every message and performs an early parse if security is engaged on the flow. The same solution can be extended to support other modules that require parsing message payload in the Axis2 handler chain.
The reason I decided to compile this blog is because WSO2 folks just released WSO2 ESB 4.6. And this release is based on the new model I’ve described here. Pass-through transport is what the users now get by default. The WSO2 team has also published some performance figures that clearly indicate what the new design is capable of. It turns out the latest release of WSO2 ESB outperforms all the major open source ESB vendors by a significant margin. This release also comes with a new XSLT mediator (Fast XSLT) that operates on the top of the pass-through model of the underlying transport and a new streaming XPath implementation based on Antlr.
The next step of this effort would be to get these improvements integrated into the Apache Synapse code base. This work is already underway and you can monitor its progress through SYNAPSE-913 and SYNAPSE-920.

Monday, January 28, 2013

Introducing AppsCake: Makes Deploying AppScale a Piece of Cake

One of my very first contributions to AppScale was a component named AppsCake. AppsCake is a dynamic web component, which provides a web frontend for the command-line AppScale Tools. It enables the users to deploy and start AppScale over several different types of infrastructure. This greatly reduces the overhead of starting and managing a PaaS as most of the heavy lifting operations can be performed easily by a click of a button. Users do not need to learn the AppScale Tools commands nor they have to be familiar with any command-line interface. With AppsCake, a regular web browser is all you need to initialize AppScale and start deploying applications in the cloud.
As of now AppsCake supports deploying AppScale over virtualized clusters (eg: Xen), Amazon EC2 and Eucalyptus. Users can select the environment in which AppScale should be deployed and provide the required credentials and other metadata for the target environment through the web interface. AppsCake takes care of invoking the proper command sequences with the appropriate arguments to initialize AppScale. The web frontend also allows the users to view deployment logs and monitor the deployment progress in near real-time. 
This component can be further extended and be offered as a service of its own if needed. That way, users can access AppsCake through a well-known URL and setup an AppScale deployment remotely for the purpose of executing a specific task or an application. As an example consider a group of scientists who want to run various scientific computations in the cloud (say as MPI or MapReduce jobs).  The group can use a private Eucalyptus cluster or a shared EC2 account as their computing infrastructure. The group can be provided with a single well-known AppsCake instance as the entry point for AppScale. Then whenever a member of the team wants to run a computation on the target shared environment, he or she can use the AppsCake service to initiate his or her own AppScale instance and run the required computation in the cloud. This scheme maximizes resource sharing while providing sufficient isolation between applications/jobs initiated by individual users.
AppsCake is implemented using Ruby and Sinatra. To try this out, simply checkout the source from Github, and execute the bin/debian_setup.sh script (build script only supports Debian/Ubuntu environments as of now). Then execute bin/appscake to start the AppsCake web service. Now you can point your browser to https://localhost:28443 and start interacting with the service.
Chris has posted a neat little screencast that explains how to use AppsCake to deploy AppScale on Virtual Box. Don’t forget to check that out too.

Sunday, January 20, 2013

On Premise API Management for Services in the Cloud

In some of my recent posts I explained how to install and start AppScale. I showed how to use AppScale command-line tools to manage an AppScale PaaS on virtualized environments such as Xen and IaaS environments such as EC2 and Eucalyptus. Then we also looked at how to deploy Google App Engine (GAE) apps over AppScale. In this post we are going to try something different.
Here I’m going to describe a possible hybrid architecture for deploying RESTful services in the cloud and exposing those services through an on-premise API management platform. This type of an architecture is most suitable for B2B integration scenarios where one organization provides a range of services and several other organizations consume them with their own custom use cases and SLAs. Both service providers and service consumers can greatly benefit from the proposed hybrid architecture. It enables the API providers to reap the benefits of the cloud with reduced deployment cost, reduced long-term maintenance overhead and reduced time-to-market. API consumers can use their own on-premise API management platform as a local proxy, which provides powerful access control, rate control, analytics and community features on top of the services already deployed in the cloud. 
To try this out, first spin up an AppScale PaaS in a desired cloud environment. You can refer my previous posts or go through the AppScale wiki to learn how to do this. Then we can deploy a simple RESTful web service in our AppScale cloud. Here I’m posting the source code for a simple web service called “starbucks” written in Python using the GAE APIs. The “starbucks” service can be used to submit and manage simple drink orders. It uses the GAE datastore API to store all the application data and exposes all the fundamental CRUD operations as REST calls (Creare = POST, Update = PUT, Read = GET, Delete = DELETE).
try:
  import json
except ImportError:
  import simplejson as json

import random
import uuid
from google.appengine.ext import db, webapp
from google.appengine.ext.webapp.util import run_wsgi_app

PRICE_CHART = {}

class Order(db.Model):
  order_id = db.StringProperty(required=True)
  drink = db.StringProperty(required=True)
  additions = db.StringListProperty()
  cost = db.FloatProperty()

def get_price(order):
  if PRICE_CHART.has_key(order.drink):
    price = PRICE_CHART[order.drink]
  else:
    price = random.randint(2, 6) - 0.01
    PRICE_CHART[order.drink] = price
  if order.additions is not None:
    price += 0.50 * len(order.additions)
  return price

def send_json_response(response, payload, status=200):
  response.headers['Content-Type'] = 'application/json'
  response.set_status(status)
  if isinstance(payload, Order):
    payload = {
      'id' : payload.order_id,
      'drink' : payload.drink,
      'cost' : payload.cost,
      'additions' : payload.additions
    }
  response.out.write(json.dumps(payload))

class OrderSubmissionHandler(webapp.RequestHandler):
  def post(self):
    order_info = json.loads(self.request.body)
    order_id = str(uuid.uuid1())
    drink = order_info['drink']
    order = Order(order_id=order_id, drink=drink, key_name=order_id)
    if order_info.has_key('additions'):
      additions = order_info['additions']
      if isinstance(additions, list):
        order.additions = additions
      else:
        order.additions = [ additions ]
    else:
      order.additions = []
    order.cost = get_price(order)
    order.put()
    self.response.headers['Location'] = self.request.url + '/' + order_id
    send_json_response(self.response, order, 201)

class OrderManagementHandler(webapp.RequestHandler):
    def get(self, order_id):
      order = Order.get_by_key_name(order_id)
      if order is not None:
        send_json_response(self.response, order)
      else:
        self.send_order_not_found(order_id)

    def put(self, order_id):
      order = Order.get_by_key_name(order_id)
      if order is not None:
        order_info = json.loads(self.request.body)
        drink = order_info['drink']
        order.drink = drink
        if order_info.has_key('additions'):
          additions = order_info['additions']
          if isinstance(additions, list):
            order.additions = additions
          else:
            order.additions = [ additions ]
        else:
          order.additions = []
        order.cost = get_price(order)
        order.put()
        send_json_response(self.response, order)
      else:
        self.send_order_not_found(order_id)

    def delete(self, order_id):
      order = Order.get_by_key_name(order_id)
      if order is not None:
        order.delete()
        send_json_response(self.response, order)
      else:
        self.send_order_not_found(order_id)

    def send_order_not_found(self, order_id):
      info = {
        'error' : 'Not Found',
        'message' : 'No order exists by the ID: %s' % order_id,
      }
      send_json_response(self.response, info, 404)

app = webapp.WSGIApplication([
    ('/order', OrderSubmissionHandler),
    ('/order/(.*)', OrderManagementHandler)
], debug=True)

if __name__ == '__main__':
  run_wsgi_app(app)
Before we go any further let’s take a few seconds and appreciate how simple and concise this piece of code is. With just about 100 lines of Python code we have developed a comprehensive webapp, which uses JSON as the data exchange format and also does database access and provides decent error handling. Imagine doing the same thing in a language like Java in a traditional servlet container environment. We will have to write lot more code and also bundle a ridiculous amount of additional dependencies to parse and construct JSON and perform database queries. But as seen here, GAE APIs make it absolutely trivial to develop powerful web APIs for the cloud with a minimum amount of code.
You can download the complete “starbucks” application from here. Simply extract the downloaded tar ball and you’re good to go. The webapp consists of just 2 files. The main.py contains all the source code of the app and app.yaml is the GAE webpp descriptor. No additional libraries or files are needed to make this work. Use AppScale-Tools to deploy the app in your AppScale cloud.
appscale-upload-app –-file /path/to/starbucks --keyname my_key_name
To try out the app, put the following JSON string into a file named order.json:
{
  "drink" : "Caramel Frapaccino",
  "additions" : [ "Whip Cream" ]
}
Now execute the following Curl request on your App:
curl –v –d @order.json –H “Content-type: application/json” http://host:port/order
Replace 'host' and 'port'  with the appropriate values for your AppScale PaaS. This request should return a HTTP 201 Created response with a Location header.
And now for the API management part. For this I’m going to use the open source API management solution from WSO2, a project that I was a part of a while ago. Download the latest WSO2 API Manager and install it on your local computer by extracting the zip archive. Go into the bin directory and execute wso2server.sh (or wso2server.bat for Windows) to start the API Manager. You need to have JDK 1.6 or higher installed to be able to do this.
Once the server is up and running, navigate to http://localhost:9763/publisher and sign in to the console using “admin” as both the username and the password. Go ahead and create an API for our “starbucks” service in the cloud. You can use http://host:port as the service URL where 'host' and 'port' should point to the AppScale PaaS. API creation process should be pretty straightforward. If you need any help, you can refer my past blog posts on WSO2 API Manager or go through the WSO2 documentation. Once the API is created and published, head over to the API Store at http://localhost:9763/store.
Now you can sign up at the API Store as an API consumer, generate an API key for the Starbucks API and start using it.
Submit Order:
curl –v –d @order.json –H “Content-type: application/json” –H “Authorization: Bearer api_key” http://localhost:8280/starbucks/1.0.0/order
Review Order:
curl –v –H “Authorization: Bearer api_key” http://localhost:8280/starbucks/1.0.0/order/order_id
Delete Order:
curl –v –X DELETE –H “Authorization: Bearer api_key” http://localhost:8280/starbucks/1.0.0/order/order_id
Replace 'api_key' with the API key generated by the API Store. Replace the 'order_id' with the unique identifier sent in the response for the submit order request.
There you have it. On-premise API management for services in the cloud. This looks pretty simple at first glimpse, but actually this is a quite powerful architecture. Note that all the critical components (service runtime, registry and consumer) are very well separated from each other, which allows maximum flexibility. The portions in the cloud can benefit from cloud specific features such as autoscaling to deliver the maximum throughput with optimal resource utilization. Since the API management platform is being controlled by individual consumer organizations, they can easily enforce their own custom policies, SLAs and optimize for their common access patterns.

Wednesday, January 9, 2013

How to Get Your Third Party APIs to Shutup?

When programming with 3rd party libraries, sometimes we need to suppress or redirect the standard output generated by the 3rd party libraries. A very common scenario is that a third party library we use in an application generates a very verbose output which clutters up the output of our program. With most programming languages we can write a simple suppress/redirect procedure to fix this problem. Such functions are sometimes colloquially known as STFU functions. Here I'm describing a couple of STFU functions I implemented in some of my recent work.

1. AppsCake (Web interface for AppScale-Tools)
This is a Ruby based dynamic web component which uses some of the core AppScale-Tools libraries. For this project I wanted to capture the standard output of the AppScale-Tools libraries and display it on a web page. As the first step I wanted to redirect the standard output of AppScale-Tools to a separate text file. Here's what I did.
def redirect_standard_io(timestamp)
  begin
    orig_stderr = $stderr.clone
    orig_stdout = $stdout.clone
    log_path = File.join(File.expand_path(File.dirname(__FILE__)), "..", "logs")
    $stderr.reopen File.new(File.join(log_path, "deploy-#{timestamp}.log"), "w")
    $stderr.sync = true
    $stdout.reopen File.new(File.join(log_path, "deploy-#{timestamp}.log"), "w")
    $stdout.sync = true
    retval = yield
  rescue Exception => e
    puts "[__ERROR__] Runtime error in deployment process: #{e.message}"
    $stdout.reopen orig_stdout
    $stderr.reopen orig_stderr
    raise e
  ensure
    $stdout.reopen orig_stdout
    $stderr.reopen orig_stderr
  end
  retval
end
Now whenever I want to redirect the standard output and invoke the AppScale-Tools API I can do this.
redirect_standard_io(timestamp) do
   # Call AppScale-Tools API
end
2. Hawkeye (API fidelity test suite for AppScale)
This is a Python based framework which makes a lot of RESTful invocations using the standard Python httplib API. I wanted to trace the HTTP requests and responses that are being exchanged during the execution of the framework and log them to a separate log file. Python httplib has a verbose mode which can be enabled by passing a special flag to the HTTPConnection class and it turns out this mode logs almost all the information I need. But unfortunately it logs all this information to the standard output of the program thus messing up the output I wanted to present to users. Therefore I needed a way to redirect the standard output for all httplib API calls. Here's how that problem was solved.
http_log = open('logs/http.log', 'a')
original = sys.stdout
sys.stdout = http_log
try:
  # Invoke httplib
finally:
  sys.stdout = original
  http_log.close()

Tuesday, January 8, 2013

Evolution of Networked Computing

  • 1969: As of now, computer networks are still in their infancy, but as they grow up and become sophisticated, we will probably see the spread of 'computer utilities', which like present electric and telephone utilities, will service individual homes and offices across the country.Leonard Kleinrock, UCLA
  • 1984: The network is the computer. - John Gage, Sun Microsystems
  • 2008: The data center is the computer. - David Patterson, UC Berkeley
  • 2008: Cloud is the computer. - Rajkumar Buyya, Melbourne University
From the book "Distributed and Cloud Computing" by Kai Hwang et al.

Thursday, January 3, 2013

The Era of Webapps is Over - Say Hello to Web APIs


I remember a time when developing a web application (webapp) was considered one of the coolest and most exciting feats a software developer could perform. It was not so long ago when a software product with no web application component was considered uncool, unpresentable and unmarketable. A wide range of dynamic web programming technologies and standards were born in this era and they continued to thrive thanks to the unprecedented growth and evolution of the Internet. However today if we take a closer look at how the Internet is being used in our day-to-day lives and in the business world, one thing becomes hauntingly obvious. The era of the webapps has passed. Now is the era of the web APIs.
First of all lets take a closer look at webapps. Webapps are designed to be directly consumed by human users. The user is aware of the URL through which the application can be accessed which he/she enters into a web browser. The web browser then interacts with the remote URL by making HTTP GET requests to pull content and HTTP POST requests to submit content. HTML is used as the primary data exchange format and how the content appears on the user’s web browser (style and formatting) is actually a huge deal. All the important information that should be communicated to the user must be embedded in the HTML payload as that’s the only thing that’s going to get rendered on the screen by the web browser. Things like HTTP status codes and headers do not play a big role in webapps, except perhaps when reporting an error (404 Not Found, 500 Internal Server Error etc) and performing a redirect (302 Found + Location header). 
But webapps are increasingly becoming less interesting to both developers and users. Today it’s all about web APIs. Interestingly most of the technologies that are used to develop webapps can also be used to create and expose web APIs. In fact it’s not wrong to say that web APIs are the next generation webapps and webapp development frameworks mutated into web API development frameworks as a consequence of the natural evolution of the web.
So how do web APIs differ from webapps? And what makes them cooler? Unlike webapps, web APIs are not designed to be directly consumed by human users. They are APIs, meaning they are designed to be consumed by other applications. Developers can use web APIs to construct other high level APIs and end-user applications. The end-user may or may not know the exact location or the URL with which the application is interacting with. Web APIs may use any content exchange format but JSON and XML are the most popular choices. A properly designed web API would use most of the available HTTP methods (at very least GET, POST, PUT, DELETE and OPTIONS) combined with the proper use of HTTP status codes and headers to pass critical control information. Things like layout and formatting don’t mean anything in the web API world but effective use of URL patterns, intuitiveness and simplicity of the APIs mean everything.
The end-user applications that are developed using web APIs can include desktop applications and more interestingly mobile applications. In my opinion this is the last nail in the coffin of the webapps. Simply put people are no longer browsing the web. Rather they use mobile apps. Latest reports and Internet usage surveys show that the amount of mobile Internet traffic is starting to bypass the amount of desktop Internet traffic (see the references section). This means the webapps are increasingly becoming obsolete. To point out a real world example lets take GitHub. GitHub is a pretty cool webapp, one of the best in my personal opinion. We can consume this app in a desktop environment using a traditional web browser such as Firefox. We can do the same with a smart phone, using the mini web browser the device is equipped with (eg: Safari for iPhones). But that’s not good enough for us. We need a native GitHub app for our smart phones. And as a result we now have the GitHub app for iPhone and Android platforms. Soon these mobile apps will become dominant traffic sources of GitHub. This has already happened to many of the popular social networking service providers such as Twitter and Facebook. The web API is a more critical component in these systems compared to their webapp counterparts. 
This transition from webapps to web APIs is a crucial one for technology driven companies. They have to carefully assess this trend and adjust their game plans accordingly. Many organizations have already realized the shift towards the web APIs and started to expose their business functionality as web APIs rather than as webapps. Web APIs open up businesses towards a larger and more diverse clientele with ample future proofing and more room for change. This massive push towards APIs has also given birth to the field of API management, which has now become a business of its own and starting to produce many lucrative business opportunities around the world. The growing number of on-premise and cloud API management solutions (Layer7, Apigee, Mashery etc) available is a testament to this fact.
Perhaps the most impacted by this change are the developers. They need to take a whole different stance in the way they think about software solutions that run over the web. They should learn to implement systems for other developers rather than for end users. They should learn to optimize systems for machine-to-machine interaction rather than for human computer interaction. They need to pay a lot of attention to little things like using proper HTTP codes, using proper HTTP headers, and adhering to open standards. Problems they are used to be messing with such as improving the browser compatibility of webapps and session management are becoming less important by day. They will have to learn to pay less attention to things like form authentication and adopt more HTTP-friendly security mechanisms such as BasicAuth and OAuth. API management is going to become a mainstream technology and developers will have to start treating it as primary development tool just like version controlling and issue tracking. Technology platform providers will also have to take this shift seriously. Developers no longer need webapp development frameworks. They need web API development frameworks. This is why platforms like Google App Engine has become so successful. Their staying power resides in their ability to facilitate the development of powerful web APIs with very little amount of code.
So does all this mean traditional webapps are dead (as in dead for good)? Well, not really. The need for webapps will always be there, at least for the foreseeable future. But we will see more dominant deployment and adoption of web APIs compared to webapps. Webapps will become the Cobol of 21st century; Thousands of lines of code are written every year, but nobody gives a damn. Newly implemented webapps will sit on a layer of web APIs making the webapp just one of the many frontends of a larger system. 
All in all, if you’re a developer, some very exciting times are ahead of you. So buckle up!

References