Saturday, November 10, 2012

Flexible Cloud Computing with AppScale

I recently started contributing to AppScale, an open source project aimed at developing a scalable Platform-as-a-Service (PaaS) solution. AppScale project was initiated by UC Santa Barbara with the intention of implementing an open cloud PaaS that would enable more research and studies in the area of cloud computing. But over the years AppScale has evolved rapidly gathering a wide range of features and now many enterprise users are finding it useful as a platform that facilitates private, public and hybrid cloud deployments. 
One of the most attractive features of AppScale is the flexibility it provides to the cloud administrators and cloud application developers. Cloud administrators can deploy AppScale on a variety of infrastructure setups. It can be deployed on virtualized computing clusters based on solutions such as Xen and KVM. AppScale also runs on Infrastructure-as-a-Service (IaaS) solutions such as Amazon EC2 and Eucalyptus (It is worth mentioning that Eucalyptus also started out as a research project in UC Santa Barbara). Also if needed AppScale can be deployed directly on physical hardware without the support of any virtualization service. AppScale also comes with a Ruby API and a set of command line tools that can be used to deploy AppScale clouds on any of the above infrastructure setups with minimal human intervention. A single shell command is all it takes to deploy even a 100-node AppScale cloud. To make this process even easier, I recently implemented a new web UI component which allows users to deploy AppScale without bothering about the infrastructure complexities at all (more on this in a future blog post).
As a PaaS offering, AppScale exports a wide range of services for the cloud application developers to use in their applications:
  • Datastore - Persistent storage for application data. Generally operates as a replicated key-value store with support for range queries and transactions within entity groups.
  • Namespace - Facilitates segmenting data into multiple partitions. Can be used in scenarios where certain data items need to be separated from each other (e.g: Production data vs Test data)
  • Memcache - Distributed cache. Useful in developing stateful applications and improving application performance.
  • Blobstore - Persistent storage for large data objects and files.
  • XMPP - Provides instant messaging capabilities to AppScale applications.
  • Channel - Allows pushing data into client's JavaScript code.
  • Users - User account creation and profile management.
  • Mail - Facilitates sending e-mails from applications.
  • Images - Supports programmatic manipulation of images.
  • URL Fetch - Facilitates consuming local and remote REST APIs.
  • Task Queue - Facilitates asynchronous execution of long running jobs.
All these fundamental services are fully API compatible with Google App Engine (GAE). Therefore any GAE application can be deployed on AppScale with zero modifications. This has two very interesting outcomes for the users. First it makes it absolutely simple for the users to migrate from GAE to their own private or hybrid cloud offering based on AppScale. Second, it allows developing applications for AppScale quite straightforward as all GAE APIs are very well documented and comes with a powerful SDK. As a result AppScale has managed to gather a large number of sample applications and a very large developer community actively writing apps for AppScale in very quick time. Just like in GAE, applications can be developed in Java, Python or Go for AppScale. Another interesting aspect is that AppScale allows using a wide range of database systems underneath its Datastore API. Currently supported database systems include Cassandra, HBase, Hypertable, MongoDB, MemcacheDB, MySQL cluster, Voldemort and Redis. This is one area where the flexibility of the AppScale architecture can be observed clearly as it enables cloud administrators to setup AppScale with any one of these database solutions depending on their application requirements and organizational standards.
One thing that should be stressed is that AppScale is not just about running GAE applications. It facilitates deploying a wide range of other applications in the cloud too. This is mainly enabled by Neptune, a software overlay that runs on top of AppScale. It's comprised of a domain specific language that allows developers to execute any arbitrary program in the AppScale cloud PaaS. These programs may include standalone programs written using any arbitrary language, MapReduce jobs and high performance computing applications developed using technologies such as MPI, UPC, X10 and StochKit. The ability of AppScale and Neptune to execute high performance computing applications in the cloud has attracted a lot of attention from the scientific research community as it enables executing long running resource intensive tasks on the cloud using as many nodes as required thus greatly reducing the task completion time and eliminating the need to procure expensive server grade hardware.
On top of all this flexibility, AppScale also provides excellent fault-tolerance and autoscaling capabilities. All the critical services such as the database and the application server can be easily replicated for high availability. ZooKeeper is used to keep track of all the active services and nodes, and automatic failover is performed upon detecting failures. The AppScale autoscaler component keeps track of resource utilization and system performance related metrics, and spins out new nodes dynamically as the demand changes over time. Autoscaler is another very flexible component in the AppScale architecture, in that it allows cloud administrators to engage custom autoscaling policies depending on their application performance and scalability requirements. Some of the built-in autoscaling policies include HA aware autoscaling, QoS aware autoscaling and cost aware autoscaling. If needed more than one autoscaling policy can be engaged at once with an administrator defined priority arrangement. 
One of my personal favorite features of AppScale is its placement support. This is the ability of the PaaS to smartly place cloud services given a set of nodes. For an example if we start an AppScale instance  using three nodes (that is 3 physical or virtual machines), it will place an application server component and a database component in each of them. One of the database components would act as the master and the others would act as slaves. Automatic data replication will be enabled among all database components. One of the three nodes will be designated as the head node and the load balancer and ZooKeeper will be deployed in that node. Note that AppScale attempts to use the nodes in the most optimal manner possible by replicating all the critical services. The actual placement strategy however is also configurable in AppScale. But if the administrator does not explicitly state a placement strategy, we can rely on AppScale to figure out a suitable placement strategy on its own. 
If my introduction of AppScale has intrigued you to try it out, feel free to grab the latest stable source from our github repo. Detailed instructions on building the source and creating your own AppScale machine images can be found in the following wiki pages:
If you need a more ready-to-roll distribution of AppScale to take a quick look, check out our public EC2 image ami-52912a3b which is preloaded with AppScale 1.6.3. If you already have an EC2 account, you can simply setup AppScale command line tools on your computer and start an AppScale instance in EC2 using the tools and the above AMI.
I will roll out a couple of detailed blog posts on setting up AppScale in the near future, so stay tuned.