Thursday, January 2, 2014

Calling WSO2 Admin Services in Python

I’m using some WSO2 middleware for my ongoing research, and recently I had the requirement of calling some admin services from Python 2.7. All WSO2 products expose a number of special administrative web services (admin services), using which the WSO2 server instances can be controlled, configured and monitored. In fact, all the web-based UI components that ship with WSO2 middleware make use of these admin services under the hood to manage the server runtime.
WSO2 admin services are SOAP services (based on Apache Axis2), and are secured using HTTP basic authentication. All admin services expose a WSDL document using which client applications can be written or generated to consume the admin services. In this post I’m going to summarize how to implement a simple Python client to consume the WSO2 admin services.
We will be writing our Python client using the Suds SOAP library for Python. Suds is simple, lightweight and extremely easy to use. As the first step, we should install Suds. Depending on the Python package manager you wish to use, one of the following commands should do the trick (tested on OS X and Ubuntu):
sudo easy_install suds
sudo pip install suds
Next we need to instruct the target WSO2 server product to expose the admin service WSDLs. By default these WSDLs are hidden. To unhide them, open up the repository/conf/carbon.xml file of the WSO2 product, and set the value of HideAdminServiceWSDLs parameter to false:
Now restart the WSO2 server, and you should be able to access the admin service WSDLs using a web browser. For example, to access the WSDL of the UserAdmin service, point your browser to http://localhost:9443/services/UserAdmin?wsdl
Now we can go ahead and write the Python code to consume any of the available admin services. Here’s a working sample that consumes the UserAdmin service. This simply prints out a list of roles defined in the WSO2 User Management component:
from suds.client import Client
from suds.transport.http import HttpAuthenticated
import logging

if __name__ == '__main__':

    t = HttpAuthenticated(username='admin', password='admin')
    client = Client('https://localhost:9443/services/UserAdmin?wsdl', location='https://localhost:9443/services/UserAdmin', transport=t)
    print client.service.getAllRolesNames()
That’s pretty much it. I have tested this approach with several WSO2 admin services, and they all seem to work without any issues. If you need to debug something, uncomment the two commented out lines in the above example. That will print all the SOAP messages and the HTTP headers that are being exchanged.
I also tried to write a client using the popular SOAPy library, but unfortunately couldn’t get it to work due to several bugs in SOAPy. SOAPy was incapable of retrieving the admin service WSDLs over HTTPS. This can be worked around by using the HTTP URL for the WSDL, but in that case SOAPy failed to generate the correct request messages to call the admin services. Basically, the namespaces of the generated SOAP messages were messed up. But with Suds I didn’t run into any issues.

Friday, July 26, 2013

Avoiding the Risks of Cloud

It's no secret that cloud computing has transformed the way enterprises do business. It has changed the way developers write software and users interact with applications. By now, almost every business organization has a strategy on how to adopt the cloud. Those who don’t will soon be extinct. The influence of the cloud has been so phenomenal, that it truly has turned into a "take it or die" kind of a deal over the last few years.
It is also no secret that today the cloud movement is steered by a handful of giants in the IT industry. Companies like Amazon, Google, Microsoft and Salesforce are clearly among this elite group. These companies, their products and vision have been instrumental in the introduction, evolution and the popularization of the cloud technology. 
With that being the case, we must think about the implications of cloud computing on the current IT landscape of the world. Are all S&M organizations around the world going to get rid of their server racks and transfer their IT infrastructure to Amazon EC2? Are all Web applications and mobile applications going to be based on Google App Engine APIs? Are all enterprise data going to end up in Amazon S3 and Google Megastore? What sort of defenses are in place to prevent a few IT giants from monopolizing the entire IT infrastructure and services market? How easy it would be for us to migrate from one cloud vendor to another? All these are indeed very real and very important problems that all organizations should take under careful consideration.
Fortunately there are several practical solutions to all the above issues. One is openness and standardization. Cloud platforms that are based on open standards and protocols should be preferred over those that use proprietary standards and protocols. Open standards and protocols are likely to be supported by more than just one cloud vendor thus enabling the users to migrate between different vendors easily. Also, in many cases open standards make it easier to port existing standalone applications to the cloud. Take a Java web application for an example. Most Java web applications are based on the J2EE suite of standards (JSP, Servlets, JDBC etc.). If the target cloud platform also supports these open standards, the user can easily migrate his J2EE app to the cloud without having to make too many changes. Similarly he can easily migrate the app from one cloud platform to another as long as both platforms support the same J2EE standards. 
Speaking of openness, cloud platforms that are open source and distributed under liberal licenses should get extra credit over closed source ones. Open source cloud platforms allow the user to modify and shape the platform according to the user requirements, rather than forcing the user to change their apps according to the changes made by the cloud platform vendor. Also, with an open source cloud framework, users will be in a position to maintain and support the platform on their own, in a situation where the original vendor decides to discontinue support for the platform.
Another possible solution is to use a hybrid cloud approach instead of solely relying on a remote public cloud maintained by a third party vendor. A hybrid cloud approach typically involves a private cloud maintained by the user, and then selectively bursting into the public cloud to handle high availability and high scalability scenarios. This method does involve some additional expenses and legwork on the user's part but the user ultimately remains in control of his data and applications, and no third party vendor can take that away from the user. Also as far as most S&M organizations are concerned, what they expect from the cloud are features like multi-tenancy, self-provisioning, optimal resource utilization and auto-scaling. Spending a few bucks on running a server rack or two to make that happen is usually not a big deal. Most companies do that today anyway. However, from a technical standpoint, we need easy-to-deploy, easy-to-maintain and reliable private cloud frameworks, which are compatible with popular public cloud platforms to really take advantage of this hybrid cloud model. Fortunately, thanks to some excellent work by a few start-ups like Eucalyptus and AppScale, this is no longer an issue. These vendors provide highly competitive private cloud and hybrid cloud solutions that are fully compatible with widely used public cloud platforms such as AWS and Google App Engine. If the user is capable of procuring the necessary hardware resources and manpower, these cloud platforms can even be used to setup fully-fledged private clouds that have all the bells and whistles of popular public clouds. That’s a great way to bask in the glory of the cloud, while maintaining full ownership and control over your enterprise IT assets.
Software frameworks like Apache JClouds provide another approach for dealing with potential risks of the cloud. These software frameworks allow user's code to interact with multiple heterogeneous cloud platforms by abstracting out the differences between various clouds. If we consider JClouds, as of now it supports close to 30 different cloud platforms including AWS, OpenStack and Rackspace. This implies that any application written using JClouds can be executed on around 30 different cloud platforms without having to make any code changes. As the influence of the cloud continues to grow, developers should seriously consider writing their code using high-level APIs like JClouds, without getting tied into a single specific cloud platform.
Cloud has certainly changed the way we all think about IT and computing. While its benefits are quite attractive, it also comes with a few potential risks. Users and developers should think carefully, plan ahead and take preventive action soon to avoid these pitfalls.

Friday, June 21, 2013

White House API Standards, DX and UX

The White House recently published some standards for developing web APIs. While going through the documentation, I came across a new term - DX. DX stands for developer experience. As anybody would understand, providing a good developer experience is the key to the success of a web API. Developers love to program with clean, intuitive APIs. On the other hand clunky, non-intuitive APIs are difficult to program with and usually are full of nasty surprises that make the developer's life hard. Therefore DX is perhaps the single most important factor when it comes to differentiating a good API from a not-so-good API.
The term DX reminds me of another similar term - UX. As you would guess UX stands for user experience. A few years ago UX was one of the most exciting topics in the IT industry. For a moment there everybody was talking and writing about UX and how websites and applications should be developed with UX best practices in mind. It seems with the rise of the web APIs, cloud and mobile apps, DX is starting to generate a similar buzz. In fact I think for a wide range of application development, PaaS, web and middleware products DX would be way more important than UX. Stephen O'Grady was so right. Developers are the new kingmakers

Wednesday, June 19, 2013

Is Subversion Going to Make a Come Back?

The Apache Software Foundation (ASF) announced the release of Subversion 1.8 yesterday. As I started to read the release note, I started wondering how come Subversion is still alive. The ASF heavily use Subversion for pretty much everything. In fact the source code of Subversion is also managed using a Subversion repository. But outside the ASF I've seen a strong push towards switching from Subversion to Git. Most startups and research groups that I know of have been using Git from day one. WSO2, the company I used to work for, is in the process of moving their code to Git. Being an Apache committer I obviously have to use Subversion regularly. But about a year ago I started using Git (GitHub to be exact) for my other development activities, and I absolutely adore it. It scales well for large code bases and large development teams, and it makes common tasks such as merging, reverting, reviewing other people's work and branching so much easier and intuitive. 
But as it turns out Subversion is still the world's most widely used source version control system. As declared in the official blog post rolled out by the ASF yesterday, a number of tech giants including WordPress heavily use Subversion. According to Ohloh, the percentage of open source projects that use Subversion is around 53%, compared to the 29% that use Git. Looks like Subversion has managed to capture quite a share of the market making it a very hard-to-kill technology. It would be interesting to see how the competition between Subversion and Git would unfold in the future. It seems the new release comes with a bunch of new features, which indicates that the project is very much alive and kicking and the Subversion community is not even close to giving up on the project.

Friday, June 14, 2013

More Reasons to Love Python - A Lesson on KISS

Recently I've been doing some work in the area of programming language design. At one point I wanted to define a Python subset which allows only the simplest Python statements without loops, conditionals, functions, classes and a bunch of other high-level constructs. So I looked into the grammar specification of the Python language and I was astonished by its simplicity and succinctness. Click here to take a look for yourself. It's no longer than 125 lines of text, and the whole thing can be printed on one side of an A4 sheet. This is definitely one of those instances where the best design is also the simplest design. No wonder everybody loves Python.
However that's not the whole point. Having selected a suitable Python subset, I was looking into ways for implementing a simple parser for those grammar rules. I've done some work with JavaCC in the past, so I straightaway jumped into implementing a Java-based parser for the selected Python subset using JavaCC. After a few hours of coding I managed to get it working too. The next step of my project required me to do some analysis on the abstract syntax tree (AST) produced by the parser. I was looking around for some existing work that fits my requirements, and I came across Python's native ast module. I immediately realized that all those hours I spent on implementing the JavaCC-based parser is a complete waste. The ast module provides excellent support for parsing Python code and constructing ASTs. This is all you have to do parse some Python code using the ast module and obtain an AST representation of the code.
import ast

# The variable 'source' contains the Python statement to be parsed
source = 'x = y + z'
tree = ast.parse(source)
The ast module supports several modes. The default mode is exec which supports parsing a sequence of Python statements. The module also supports a special eval mode which can be used to parse simple one-liner Python statements. It turned out the eval mode supports more or less the same exact Python subset I wanted to use. So I threw away my JavaCC-based parser and wrote the following snippet of Python code to get my job done.
import ast

# The variable 'source' contains the Python statement to be parsed
source = 'x = y + z'
tree = ast.parse(source, mode='eval')
Now when it came to analyzing the AST produced by the parser, the ast module again turned out to be useful. The module provides two helper classes, namely NodeVisitor and NodeTransformer which can be used to either traverse or transform a given Python AST. To use these helper classes, we just need to extend them and implement the appropriate visit methods. There's a unique top level visit method and one visit_ method per AST node type (e.g. visit_Str, visit_Num, visit_BoolOp etc.). Here's an example NodeVisitor implementation, that flattens a given Python AST into a list.
class NodeEnumerator(ast.NodeVisitor):
  def get_node_list(self, tree):
    self.nodes = []
    return self.nodes

  def visit(self, node):
These helper classes can be used to do virtually anything with a given AST. If you want you can even implement a Python interpreter in Python using this approach. In my case I'm running some search and isomorphism detection algorithms on the Python AST's.
So once again I've been pleasantly surprised and deeply impressed by the simplicity and richness of Python. It looks like the designers of Python have thought of everything. Kudos to Python aside, this whole experience taught me to always looks for existing, simple solutions before doing it in my own complicated way. It actually reminds me of the good old KISS principle - "Keep It Simple, Stupid". 

Friday, April 5, 2013

MDCC - Strong Consistency with Performance

A few weeks back me and a couple of my colleagues finished developing a complete implementation of the MDCC (Multi-Data Center Consistency) protocol. MDCC is a fast commit protocol proposed by UC Berkeley for large-scale geo-replicated databases. The main advantage of MDCC is that is supports strong consistency for data while providing transaction performance similar to eventually consistent systems. 
With traditional distributed commit protocols, supporting strong consistency usually requires executing complex distributed consensus algorithms (e.g. Paxos). Such algorithms generally require multiple rounds of communication. Therefore when deployed in a multi-data center setting where the inter-data center latency is close to 100ms, the performance of the transactions being executed degrades to almost unacceptable levels. For this reason most replicated database systems and cloud data stores has opted to support a weaker notion of consistency. This greatly speeds up the transactions but you always run the risk of data becoming inconsistent or even lost.
MDCC employs a special variant of Paxos called Fast Paxos. Fast Paxos takes a rather optimistic approach by which it is able to commit most transactions within a single network roundtrip. This way a data object update can be replicated to any number of data centers within a single request-response window. The protocol is also effectively masterless which means if the application is executing in a data center in Europe, it does not have to contact a special master server which could potentially reside in a data center in USA. The only time this protocol doesn't finish within a single request-response window is when two or more transactions attempt to update the same data object (transaction conflict). In that case a per-object master is elected and the Classic Paxos protocol is invoked to resolve the conflict. If the possibility of a conflict is small, MDCC will commit most transactions within a single network roundtrip thus greatly improving the transaction throughput and latency. 
Unlike most replicated database systems, MDCC doesn't require explicit sharding of data into multiple segments. But it can be supported on MDCC if needed. Also unlike most cloud data stores, MDCC has excellent support for atomic multi-row (multi-object) transactions. That is multiple data objects can be updated atomically within a single read-write transaction. All these interesting properties make MDCC an excellent choice for implementing powerful database engines for modern day distributed and cloud computing environments.
Our implementation of MDCC is based on Java. We use Apache Thrift as the communication framework between different components. ZooKeeper is used for leader election purposes (we need to elect a per-object leader whenever there is a conflict). HBase server is used as the storage engine. All the application data and metadata are stored in HBase. In order to reduce the number of storage accesses we also have a layer of in-memory caching. All the critical information and updates are written through to the underlying HBase server to maintain strong consistency. The cache still helps to avoid a large fraction of storage references. Our experiments show that most read operations are able to complete without ever going to HBase layer. 
We provide a simple and intuitive API in our MDCC implementation so that users can write their own applications using our MDCC engine. A simple transaction implementing using this API would look like this.
        TransactionFactory factory = new TransactionFactory();
        Transaction txn = factory.create();
        try {
            byte[] foo ="foo");
            txn.write("bar", "bar".getBytes());
        } catch (TransactionException e){
        } finally {
We also did some basic performance tests on our MDCC implementation using the YCSB benchmark. We used 5 EC2 micro instances distributed across 3 data centers (regions) and deployed a simple 2-shard MDCC cluster. Each shard consisted of 5 MDCC storage nodes (amounting to a total of 10 MDCC storage nodes). We ran several different types of workloads on this cluster and in general succeeded in achieving < 1ms latency for read operations and < 100ms latency for write operations. Our implementation performs best with mostly-read workloads, but even with a fairly large number of conflicts, the system delivers reasonable performance. 
Our system ensures correct and consistent transaction semantics. We have excellent support for atomic multi-row transactions, concurrent transactions and even some rudimentary support for crash recovery. If you are interested to give this implementation a try, grab the source code from Use Maven3 to build the distribution, extract and run.