Friday, April 5, 2013

MDCC - Strong Consistency with Performance

A few weeks back me and a couple of my colleagues finished developing a complete implementation of the MDCC (Multi-Data Center Consistency) protocol. MDCC is a fast commit protocol proposed by UC Berkeley for large-scale geo-replicated databases. The main advantage of MDCC is that is supports strong consistency for data while providing transaction performance similar to eventually consistent systems. 
With traditional distributed commit protocols, supporting strong consistency usually requires executing complex distributed consensus algorithms (e.g. Paxos). Such algorithms generally require multiple rounds of communication. Therefore when deployed in a multi-data center setting where the inter-data center latency is close to 100ms, the performance of the transactions being executed degrades to almost unacceptable levels. For this reason most replicated database systems and cloud data stores has opted to support a weaker notion of consistency. This greatly speeds up the transactions but you always run the risk of data becoming inconsistent or even lost.
MDCC employs a special variant of Paxos called Fast Paxos. Fast Paxos takes a rather optimistic approach by which it is able to commit most transactions within a single network roundtrip. This way a data object update can be replicated to any number of data centers within a single request-response window. The protocol is also effectively masterless which means if the application is executing in a data center in Europe, it does not have to contact a special master server which could potentially reside in a data center in USA. The only time this protocol doesn't finish within a single request-response window is when two or more transactions attempt to update the same data object (transaction conflict). In that case a per-object master is elected and the Classic Paxos protocol is invoked to resolve the conflict. If the possibility of a conflict is small, MDCC will commit most transactions within a single network roundtrip thus greatly improving the transaction throughput and latency. 
Unlike most replicated database systems, MDCC doesn't require explicit sharding of data into multiple segments. But it can be supported on MDCC if needed. Also unlike most cloud data stores, MDCC has excellent support for atomic multi-row (multi-object) transactions. That is multiple data objects can be updated atomically within a single read-write transaction. All these interesting properties make MDCC an excellent choice for implementing powerful database engines for modern day distributed and cloud computing environments.
Our implementation of MDCC is based on Java. We use Apache Thrift as the communication framework between different components. ZooKeeper is used for leader election purposes (we need to elect a per-object leader whenever there is a conflict). HBase server is used as the storage engine. All the application data and metadata are stored in HBase. In order to reduce the number of storage accesses we also have a layer of in-memory caching. All the critical information and updates are written through to the underlying HBase server to maintain strong consistency. The cache still helps to avoid a large fraction of storage references. Our experiments show that most read operations are able to complete without ever going to HBase layer. 
We provide a simple and intuitive API in our MDCC implementation so that users can write their own applications using our MDCC engine. A simple transaction implementing using this API would look like this.
        TransactionFactory factory = new TransactionFactory();
        Transaction txn = factory.create();
        try {
            txn.begin();
            byte[] foo = txn.read("foo");
            txn.write("bar", "bar".getBytes());
            txn.commit();
        } catch (TransactionException e){
            reportError(e);
        } finally {
            factory.close();
        }
We also did some basic performance tests on our MDCC implementation using the YCSB benchmark. We used 5 EC2 micro instances distributed across 3 data centers (regions) and deployed a simple 2-shard MDCC cluster. Each shard consisted of 5 MDCC storage nodes (amounting to a total of 10 MDCC storage nodes). We ran several different types of workloads on this cluster and in general succeeded in achieving < 1ms latency for read operations and < 100ms latency for write operations. Our implementation performs best with mostly-read workloads, but even with a fairly large number of conflicts, the system delivers reasonable performance. 
Our system ensures correct and consistent transaction semantics. We have excellent support for atomic multi-row transactions, concurrent transactions and even some rudimentary support for crash recovery. If you are interested to give this implementation a try, grab the source code from https://github.com/hiranya911/mdcc. Use Maven3 to build the distribution, extract and run.

Monday, March 11, 2013

Starting HBase Server Programmatically

I'm implementing a database application these days and for that I wanted to programmatically start and stop a standalone HBase server. More specifically I wanted to make HBase server a part of my application so that whenever my application starts, HBase server also starts up. This turned out to be more difficult than I thought it would be. To start a HBase server you actually need to start three things:
1. HBase master server
2. HBase region server
3. ZooKeeper
The default startup script shipped with the HBase binary distribution does all this for you. But I wanted a more tightly integrated and a fully programmatic solution. Unfortunately the HBase public API doesn't seem to expose the functionality required for programmatically starting and stopping the above components (at least not in a straightforward manner). So after going through the HBase source and trying out various things, I managed to come up with some code that does exactly what I want. At a high level, this is what my code does:
1. Create an instance of HQuorumPeer  and execute it on a separate thread.
2. Create an initialize a HBaseConfiguration instance.
3. Create an instance of HMaster and execute it on a separate thread.
4. Create an instance of HRegionServer and execute it on a separate thread.
Both HMaster and HRegionServer implement the Runnable interface. Therefore it's easy to run them on separate threads. I created a simple Java Executor instance and scheduled HMaster and HRegionServer for execution on it. But HQuorumPeer was a bit tricky. This class only contains a main method and has no such thing called a public API. So one solution is to create your own thread class, which simply invokes the above mentioned main method. The other option is to write your own HQuorumPeer class implementing the Runnable interface. The original HQuorumPeer class from the HBase project is fairly small and contains only a small amount of code. So I  took the second approach. I simply copied the code from the original HQuorumPeer and created my own HQuorumPeer implementing the Runnable interface. Overall this is what my finalized code looks like:
        
        exec.submit(new HQuorumPeer(properties));
        log.info("HBase ZooKeeper server started");
        
        Configuration config = HBaseConfiguration.create();
        File hbaseDir = new File(hbasePath, "data");
        config.set(HConstants.HBASE_DIR, hbaseDir.getAbsolutePath());
        for (String key : properties.stringPropertyNames()) {
            if (key.startsWith("hbase.")) {
                config.set(key, properties.getProperty(key));
            } else {
                String name = HConstants.ZK_CFG_PROPERTY_PREFIX + key;
                config.set(name, properties.getProperty(key));
            }
        }

        try {
            master = new HMaster(config);
            regionServer = new HRegionServer(config);
            masterFuture = exec.submit(master);
            regionServerFuture = exec.submit(regionServer);
            log.info("HBase server is up and running...");
        } catch (Exception e) {
            handleException("Error while initializing HBase server", e);
        }
Then I nicely wrapped up all this logic into a single reusable util class called HBaseServer. So whenever I want to start/stop HBase in my application, this is all I have to do.
HBaseServer hbaseServer = new HBaseServer();
hbaseServer.start();
Hope somebody finds this useful :)

Tuesday, February 5, 2013

How the World's Fastest ESB was Made

A couple of years ago, at WSO2 we implemented a new HTTP transport for WSO2 ESB. Requirements for this new transport can be summarized as follows:
  1. Ultra-fast, low latency mediation of HTTP requests.
  2. Supporting a very large number of inbound (client-ESB) and outbound (ESB-server) connections concurrently (we were looking at several thousand concurrent connections).
  3. Automatic throttling and graceful performance degradation in the presence of slow or faulty clients and servers.
The default non-blocking HTTP (NHTTP) transport from Apache Synapse, which we were also using in WSO2 ESB, supported the above requirements up to a certain extent but we wanted to do better. The default transport was very generic and it was designed to offer reasonable performance in all the integration scenarios the ESB could potentially participate in. However HTTP load balancing, HTTP URL routing (URL rewriting) and HTTP header-based routing are some of the most widely used integration patterns in the industry and to support these use cases well, we needed a specialized transport. 
The old NHTTP transport was based on a dual buffer model. Incoming message content was placed in a SharedInputBuffer and the outgoing message content was placed in a SharedOutputBuffer. Apache Axiom, Apache Axis2 and the Synapse mediation engine sit between the two buffers, reading from the input buffer and writing to the output buffer. This architecture is illustrated in the following diagram.
The key advantage of this architecture is that it enables the ESB (mediators) to intercept all the messages and manipulate them in any way necessary. The main downside is every message happens to go through the Axiom layer, which is not really necessary in cases like HTTP load balancing and HTTP header-based routing. Also the overhead of moving data from one buffer to another was not always justifiable in this model. So when we started working on the new HTTP transport we wanted to get rid of these limitations. We knew that this might result in a not-so-generic HTTP transport, but we were willing to pay that price at the time.
So after some very interesting brainstorming sessions, an exciting 1-week long hackathon followed by several months of testing, bug-fixing and refactoring we came up with what’s today known as the HTTP pass-through transport. This transport was based on a single buffer model and completely bypassed the Axiom layer. The resulting architecture is illustrated below.
The HTTP pass-through transport was first released in June 2011 along with WSO2 ESB 4.0. Back then it was disabled by default and the user had to enable it by uncommenting a few entries in the axis2.xml file. The performance numbers we were seeing with the new transport were simply remarkable. WSO2 also published some of these benchmarking results in a March 2012 article. However at this point the 2 main limitations in the new transport were starting to give us headaches.
  1. Configuration overhead (Users had to explicitly enable the transport depending on their target use cases)
  2. Cannot support any integration scenario that requires HTTP content manipulation (because Axiom was bypassed, any mediator attempting to access the message payload would not get anything useful to work with)
In addition to these technical issues there were other process related issues that we had to deal with. For instance maintaining two separate HTTP transports was twice as work for the developers and testers. We found that because the pass-through transport was not used as the default, it often lagged behind the default NHTTP transport in terms of features and stability. So after a few brainstorming sessions we decided to try and make the pass-through transport the default HTTP transport in Apache Synapse/WSO2 ESB. But this required making the content manipulation use cases (content aware use cases) work with the new transport. This implied bringing Axiom back into the picture, the very thing we wanted to avoid in our initial implementation. So in order to balance out our performance and heterogeneous integration requirements we came up with the idea of “on-demand message parsing in the mediation engine”.
In this new model, each mediator instance belongs to one of two classes.
  1. Content-unaware mediators – Mediators that never access the message content in anyway (eg: drop mediator)
  2. Content-aware mediators – Mediators that always access the message content (eg: xslt mediator)
We also identified a third class known as conditionally content-aware mediators. These mediators could be either content-aware or content-unaware depending on their exact instance configuration. For an example a simple log mediator instance, configured as <log/> is content-unaware. However a log mediator configured as <log level=”full”/> would be content-aware since it’s expected to log the message payload. Similarly a simple property mediator instance such as <property name=”foo” value=”bar”/> is content-unaware but <property name=”foo” expression=”/some/xpath”/> could be content-aware depending on what the XPath expression does. In order to capture this content-awareness characteristic of mediator instances at runtime, we introduced a new method (isContentAware) to the top level Mediator interface of Synapse. The default implementation in AbstractMediator class returns true by default so as to maintain backward compatibility. 
With this change in place we modified the mediation engine to check the content-awareness of property of each mediator at runtime before submitting a message to it. List mediators such as the SequenceMediator would run the check recursively on its child mediators to obtain the final value. Assuming that messages are always received through the pass-through HTTP transport, the mediation engine would invoke a special message parsing routine whenever a mediator is detected to be content-aware. It is in this special routine that we bring Axiom into the picture. Therefore if none of the mediators in a given flow or a service is content-aware, the pass-through transport works as it usually does without ever engaging Axiom. But whenever a content-aware mediator is involved, we bring Axiom in. This way we can reap the performance benefits of the pass-through transport while supporting all integration scenarios of the ESB. Since we engage Axiom on-demand we get the best possible outcome for all scenarios. For instance a simple pass through proxy would always work without any Axiom interactions. An XSLT proxy that transforms requests would engage Axiom only in the request flow. Response flow would operate without parsing the messages.
Another tricky problem we encountered was dealing with message parsing itself. For instance how do we parse a message and then send it out when there is only one buffer provided by the underlying pass-through transport? Ideally we need two buffers to read the incoming message from and write the outgoing message to. Also the fact that the Axis2 message builder framework can only handle streams posed a few problems. The buffer we maintained in the pass-through transport was a Java NIO ByteBuffer instance. So we needed to adapt the buffer into a stream implementation whenever the mediation engine engages Axiom. We solved the first problem by implementing our message builder routine to create a second output buffer whenever Axiom is dragged into the picture. The outgoing messages are serialized into this second buffer and the pass-through transport was modified to pick the outgoing content from the second buffer when it’s available. Writing an InputStream implementation that can wrap a ByteBuffer instance solved the second problem.
One last problem that needed to be solved was handling security. In Synapse/WSO2 ESB, security is handled by Apache Rampart, which runs as an Axis2 module that intercepts the messages before they hit the mediation engine. So on-demand parsing at the mediation engine doesn’t work in this scenario. We need to parse the messages before Rampart intercepts them. We solved this issue by introducing a new smart handler to the Axis2 handler chain, which intercepts every message and performs an early parse if security is engaged on the flow. The same solution can be extended to support other modules that require parsing message payload in the Axis2 handler chain.
The reason I decided to compile this blog is because WSO2 folks just released WSO2 ESB 4.6. And this release is based on the new model I’ve described here. Pass-through transport is what the users now get by default. The WSO2 team has also published some performance figures that clearly indicate what the new design is capable of. It turns out the latest release of WSO2 ESB outperforms all the major open source ESB vendors by a significant margin. This release also comes with a new XSLT mediator (Fast XSLT) that operates on the top of the pass-through model of the underlying transport and a new streaming XPath implementation based on Antlr.
The next step of this effort would be to get these improvements integrated into the Apache Synapse code base. This work is already underway and you can monitor its progress through SYNAPSE-913 and SYNAPSE-920.

Monday, January 28, 2013

Introducing AppsCake: Makes Deploying AppScale a Piece of Cake

One of my very first contributions to AppScale was a component named AppsCake. AppsCake is a dynamic web component, which provides a web frontend for the command-line AppScale Tools. It enables the users to deploy and start AppScale over several different types of infrastructure. This greatly reduces the overhead of starting and managing a PaaS as most of the heavy lifting operations can be performed easily by a click of a button. Users do not need to learn the AppScale Tools commands nor they have to be familiar with any command-line interface. With AppsCake, a regular web browser is all you need to initialize AppScale and start deploying applications in the cloud.
As of now AppsCake supports deploying AppScale over virtualized clusters (eg: Xen), Amazon EC2 and Eucalyptus. Users can select the environment in which AppScale should be deployed and provide the required credentials and other metadata for the target environment through the web interface. AppsCake takes care of invoking the proper command sequences with the appropriate arguments to initialize AppScale. The web frontend also allows the users to view deployment logs and monitor the deployment progress in near real-time. 
This component can be further extended and be offered as a service of its own if needed. That way, users can access AppsCake through a well-known URL and setup an AppScale deployment remotely for the purpose of executing a specific task or an application. As an example consider a group of scientists who want to run various scientific computations in the cloud (say as MPI or MapReduce jobs).  The group can use a private Eucalyptus cluster or a shared EC2 account as their computing infrastructure. The group can be provided with a single well-known AppsCake instance as the entry point for AppScale. Then whenever a member of the team wants to run a computation on the target shared environment, he or she can use the AppsCake service to initiate his or her own AppScale instance and run the required computation in the cloud. This scheme maximizes resource sharing while providing sufficient isolation between applications/jobs initiated by individual users.
AppsCake is implemented using Ruby and Sinatra. To try this out, simply checkout the source from Github, and execute the bin/debian_setup.sh script (build script only supports Debian/Ubuntu environments as of now). Then execute bin/appscake to start the AppsCake web service. Now you can point your browser to https://localhost:28443 and start interacting with the service.
Chris has posted a neat little screencast that explains how to use AppsCake to deploy AppScale on Virtual Box. Don’t forget to check that out too.

Sunday, January 20, 2013

On Premise API Management for Services in the Cloud

In some of my recent posts I explained how to install and start AppScale. I showed how to use AppScale command-line tools to manage an AppScale PaaS on virtualized environments such as Xen and IaaS environments such as EC2 and Eucalyptus. Then we also looked at how to deploy Google App Engine (GAE) apps over AppScale. In this post we are going to try something different.
Here I’m going to describe a possible hybrid architecture for deploying RESTful services in the cloud and exposing those services through an on-premise API management platform. This type of an architecture is most suitable for B2B integration scenarios where one organization provides a range of services and several other organizations consume them with their own custom use cases and SLAs. Both service providers and service consumers can greatly benefit from the proposed hybrid architecture. It enables the API providers to reap the benefits of the cloud with reduced deployment cost, reduced long-term maintenance overhead and reduced time-to-market. API consumers can use their own on-premise API management platform as a local proxy, which provides powerful access control, rate control, analytics and community features on top of the services already deployed in the cloud. 
To try this out, first spin up an AppScale PaaS in a desired cloud environment. You can refer my previous posts or go through the AppScale wiki to learn how to do this. Then we can deploy a simple RESTful web service in our AppScale cloud. Here I’m posting the source code for a simple web service called “starbucks” written in Python using the GAE APIs. The “starbucks” service can be used to submit and manage simple drink orders. It uses the GAE datastore API to store all the application data and exposes all the fundamental CRUD operations as REST calls (Creare = POST, Update = PUT, Read = GET, Delete = DELETE).
try:
  import json
except ImportError:
  import simplejson as json

import random
import uuid
from google.appengine.ext import db, webapp
from google.appengine.ext.webapp.util import run_wsgi_app

PRICE_CHART = {}

class Order(db.Model):
  order_id = db.StringProperty(required=True)
  drink = db.StringProperty(required=True)
  additions = db.StringListProperty()
  cost = db.FloatProperty()

def get_price(order):
  if PRICE_CHART.has_key(order.drink):
    price = PRICE_CHART[order.drink]
  else:
    price = random.randint(2, 6) - 0.01
    PRICE_CHART[order.drink] = price
  if order.additions is not None:
    price += 0.50 * len(order.additions)
  return price

def send_json_response(response, payload, status=200):
  response.headers['Content-Type'] = 'application/json'
  response.set_status(status)
  if isinstance(payload, Order):
    payload = {
      'id' : payload.order_id,
      'drink' : payload.drink,
      'cost' : payload.cost,
      'additions' : payload.additions
    }
  response.out.write(json.dumps(payload))

class OrderSubmissionHandler(webapp.RequestHandler):
  def post(self):
    order_info = json.loads(self.request.body)
    order_id = str(uuid.uuid1())
    drink = order_info['drink']
    order = Order(order_id=order_id, drink=drink, key_name=order_id)
    if order_info.has_key('additions'):
      additions = order_info['additions']
      if isinstance(additions, list):
        order.additions = additions
      else:
        order.additions = [ additions ]
    else:
      order.additions = []
    order.cost = get_price(order)
    order.put()
    self.response.headers['Location'] = self.request.url + '/' + order_id
    send_json_response(self.response, order, 201)

class OrderManagementHandler(webapp.RequestHandler):
    def get(self, order_id):
      order = Order.get_by_key_name(order_id)
      if order is not None:
        send_json_response(self.response, order)
      else:
        self.send_order_not_found(order_id)

    def put(self, order_id):
      order = Order.get_by_key_name(order_id)
      if order is not None:
        order_info = json.loads(self.request.body)
        drink = order_info['drink']
        order.drink = drink
        if order_info.has_key('additions'):
          additions = order_info['additions']
          if isinstance(additions, list):
            order.additions = additions
          else:
            order.additions = [ additions ]
        else:
          order.additions = []
        order.cost = get_price(order)
        order.put()
        send_json_response(self.response, order)
      else:
        self.send_order_not_found(order_id)

    def delete(self, order_id):
      order = Order.get_by_key_name(order_id)
      if order is not None:
        order.delete()
        send_json_response(self.response, order)
      else:
        self.send_order_not_found(order_id)

    def send_order_not_found(self, order_id):
      info = {
        'error' : 'Not Found',
        'message' : 'No order exists by the ID: %s' % order_id,
      }
      send_json_response(self.response, info, 404)

app = webapp.WSGIApplication([
    ('/order', OrderSubmissionHandler),
    ('/order/(.*)', OrderManagementHandler)
], debug=True)

if __name__ == '__main__':
  run_wsgi_app(app)
Before we go any further let’s take a few seconds and appreciate how simple and concise this piece of code is. With just about 100 lines of Python code we have developed a comprehensive webapp, which uses JSON as the data exchange format and also does database access and provides decent error handling. Imagine doing the same thing in a language like Java in a traditional servlet container environment. We will have to write lot more code and also bundle a ridiculous amount of additional dependencies to parse and construct JSON and perform database queries. But as seen here, GAE APIs make it absolutely trivial to develop powerful web APIs for the cloud with a minimum amount of code.
You can download the complete “starbucks” application from here. Simply extract the downloaded tar ball and you’re good to go. The webapp consists of just 2 files. The main.py contains all the source code of the app and app.yaml is the GAE webpp descriptor. No additional libraries or files are needed to make this work. Use AppScale-Tools to deploy the app in your AppScale cloud.
appscale-upload-app –-file /path/to/starbucks --keyname my_key_name
To try out the app, put the following JSON string into a file named order.json:
{
  "drink" : "Caramel Frapaccino",
  "additions" : [ "Whip Cream" ]
}
Now execute the following Curl request on your App:
curl –v –d @order.json –H “Content-type: application/json” http://host:port/order
Replace 'host' and 'port'  with the appropriate values for your AppScale PaaS. This request should return a HTTP 201 Created response with a Location header.
And now for the API management part. For this I’m going to use the open source API management solution from WSO2, a project that I was a part of a while ago. Download the latest WSO2 API Manager and install it on your local computer by extracting the zip archive. Go into the bin directory and execute wso2server.sh (or wso2server.bat for Windows) to start the API Manager. You need to have JDK 1.6 or higher installed to be able to do this.
Once the server is up and running, navigate to http://localhost:9763/publisher and sign in to the console using “admin” as both the username and the password. Go ahead and create an API for our “starbucks” service in the cloud. You can use http://host:port as the service URL where 'host' and 'port' should point to the AppScale PaaS. API creation process should be pretty straightforward. If you need any help, you can refer my past blog posts on WSO2 API Manager or go through the WSO2 documentation. Once the API is created and published, head over to the API Store at http://localhost:9763/store.
Now you can sign up at the API Store as an API consumer, generate an API key for the Starbucks API and start using it.
Submit Order:
curl –v –d @order.json –H “Content-type: application/json” –H “Authorization: Bearer api_key” http://localhost:8280/starbucks/1.0.0/order
Review Order:
curl –v –H “Authorization: Bearer api_key” http://localhost:8280/starbucks/1.0.0/order/order_id
Delete Order:
curl –v –X DELETE –H “Authorization: Bearer api_key” http://localhost:8280/starbucks/1.0.0/order/order_id
Replace 'api_key' with the API key generated by the API Store. Replace the 'order_id' with the unique identifier sent in the response for the submit order request.
There you have it. On-premise API management for services in the cloud. This looks pretty simple at first glimpse, but actually this is a quite powerful architecture. Note that all the critical components (service runtime, registry and consumer) are very well separated from each other, which allows maximum flexibility. The portions in the cloud can benefit from cloud specific features such as autoscaling to deliver the maximum throughput with optimal resource utilization. Since the API management platform is being controlled by individual consumer organizations, they can easily enforce their own custom policies, SLAs and optimize for their common access patterns.

Wednesday, January 9, 2013

How to Get Your Third Party APIs to Shutup?

When programming with 3rd party libraries, sometimes we need to suppress or redirect the standard output generated by the 3rd party libraries. A very common scenario is that a third party library we use in an application generates a very verbose output which clutters up the output of our program. With most programming languages we can write a simple suppress/redirect procedure to fix this problem. Such functions are sometimes colloquially known as STFU functions. Here I'm describing a couple of STFU functions I implemented in some of my recent work.

1. AppsCake (Web interface for AppScale-Tools)
This is a Ruby based dynamic web component which uses some of the core AppScale-Tools libraries. For this project I wanted to capture the standard output of the AppScale-Tools libraries and display it on a web page. As the first step I wanted to redirect the standard output of AppScale-Tools to a separate text file. Here's what I did.
def redirect_standard_io(timestamp)
  begin
    orig_stderr = $stderr.clone
    orig_stdout = $stdout.clone
    log_path = File.join(File.expand_path(File.dirname(__FILE__)), "..", "logs")
    $stderr.reopen File.new(File.join(log_path, "deploy-#{timestamp}.log"), "w")
    $stderr.sync = true
    $stdout.reopen File.new(File.join(log_path, "deploy-#{timestamp}.log"), "w")
    $stdout.sync = true
    retval = yield
  rescue Exception => e
    puts "[__ERROR__] Runtime error in deployment process: #{e.message}"
    $stdout.reopen orig_stdout
    $stderr.reopen orig_stderr
    raise e
  ensure
    $stdout.reopen orig_stdout
    $stderr.reopen orig_stderr
  end
  retval
end
Now whenever I want to redirect the standard output and invoke the AppScale-Tools API I can do this.
redirect_standard_io(timestamp) do
   # Call AppScale-Tools API
end
2. Hawkeye (API fidelity test suite for AppScale)
This is a Python based framework which makes a lot of RESTful invocations using the standard Python httplib API. I wanted to trace the HTTP requests and responses that are being exchanged during the execution of the framework and log them to a separate log file. Python httplib has a verbose mode which can be enabled by passing a special flag to the HTTPConnection class and it turns out this mode logs almost all the information I need. But unfortunately it logs all this information to the standard output of the program thus messing up the output I wanted to present to users. Therefore I needed a way to redirect the standard output for all httplib API calls. Here's how that problem was solved.
http_log = open('logs/http.log', 'a')
original = sys.stdout
sys.stdout = http_log
try:
  # Invoke httplib
finally:
  sys.stdout = original
  http_log.close()