Friday, March 20, 2015

QBETS: A Time Series Analysis and Forecasting Method

Today I’m going to share some details on an analytics technology I’ve been using for my research.
QBETS (Queue Bounds Estimation from Time Series) is a non-parametric time series analysis method. The basic idea behind QBETS is to analyze a time series, and predict the p-th percentile of it, where p is a user-specified parameter. QBETS learns from the existing data points in the input time series, and estimates a p-th percentile value such that the next data point in the time series has a 0.01p probability of being less than or equal to the estimated value.
For example, suppose we have the following input time series, and we wish to predict the 95th percentile of it:

A0, A1, A2, …., An

If QBETS predicts the value Q as the 95th percentile, we can say that An+1 (the next data point that will be added to the time series by the generating process), has a 95% chance of being less than or equal to Q.

P(An+1 <= Q) = 0.01p

Since QBETS cannot determine the percentile values exactly, but must estimate them, it uses another parameter c (0 < c < 1) as an upper confidence bound on the estimated values. That is, if QBETS was used to estimate the p-th percentile value of a time series with c upper confidence, it would have overestimated the p-th percentile with a probability of 1 – c. For instance, if c = 0.05, then QBETS will generate predictions that overestimate the true p-th percentile 95% of the time. We primarily use parameter c as a means of controlling how conservative we want QBETS to be, when predicting percentiles.
QBETS also supports a technique known as change point detection. To understand what this means, let’s look at the following input time series.

7, 8, 7, 7, 9, 8, 7, 7, 15, 15, 16, 14, 16, 17,15

Here we see a sudden shift in the values after the first 8 data points. The individual data point values have increased from the 7-9 range to 14-17 range. QBETS detects such change points in the time series, and takes action to discard the data points before the change point. This is necessary to make sure that the predictions are not influenced by old historical values that are no longer relevant in the time series generating process.
The paper that originally introduced QBETS, used it as a mechanism to predict the scheduling delays in batch queuing systems for supercomputers and other HPC systems. Over the years researchers have used QBETS with a wide range of datasets, and it has produced positive results in almost all the cases. Lately, I have been using QBETS as a means of predicting API response times, by analyzing historical API performance data. Again, the results have been quite promising.

To learn more about QBETS, go through the paper or contact the authors.

Sunday, January 11, 2015

Creating Eucalyptus Machine Images from a Running VM

I often use Eucalyptus private cloud platform for my research. And very often I need to start Linux VMs in Eucalyptus, and install a whole stack of software on them. This involves a lot of repetitive work, so in order to save time I prefer creating machine images (EMIs) from fully configured VMs. This post outlines the steps one should follow to create an EMI from a VM running in Eucalyptus (tested on Ubuntu Lucid and Precise VMs).

Step 1: SSH into the VM running in Eucalyptus, if you already haven't.

Step 2: Run euca-bundle-vol command to create an image file (snapshot) from the VM's root file system.
euca-bundle-vol -p root -d /mnt -s 10240
Here "-p" is the name you wish to give to the image file. "-s" is the size of the image in megabytes. In the above example, this is set to 10GB, which also happens to be the largest acceptable value for "-s" argument. "-d" is the directory in which the image file should be placed. Make sure this directory has enough free space to accommodate the image size specified in "-s". 
This command may take several minutes to execute. For a 10GB image, it may take around 3 to 8 minutes. When completed, check the contents of the directory specified in argument "-d". You will see an XML manifest file and a number of image part files in there.

Step 3: Upload the image file to the Eucalyptus cloud using the euca-upload-bundle command.
euca-upload-bundle -b my-test-image -m /mnt/root.manifest.xml
Here "-b" is the name of the bucket (in Walrus key-value store) to which the image file should be uploaded. You don't have to create the bucket beforehand. This command will create the bucket if it doesn't already exist. "-m" should point to the XML manifest file generated in the previous step.
This command requires certain environment variables to be exported (primarily access keys and certificate paths). The easiest way to do that is to copy your eucarc file and the associated keys into the VM and source the eucarc file into the environment.
This command also may take several minutes to complete. At the end, it will output a string of the form "bucket-name/manifest-file-name".

Step 4: Register the newly uploaded image file with Eucalyptus.
euca-register my-test-image/root.manifest.xml
The only parameter required here is the "bucket-name/manifest-file-name" string returned from the previous step. I've noticed that in some cases, running this command from the VM in Eucalyptus doesn't work (you will get an error saying 404 not found). In that case you can simply run the command from somewhere else -- somewhere outside the Eucalyptus cloud. If all goes well, the command will return with an EMI ID. At this point you can launch instances of your image using the euca-run-instances command.

Friday, January 2, 2015

Developing Web Services with Go

Golang facilitates implementing powerful web applications and services using a very small amount of code. It can be used to implement both HTML rendering webapps as well as XML/JSON rendering web APIs. In this post, I'm going to demonstrate how easy it is to implement a simple JSON-based web service using Go. We are going to implement a simple addition service, that takes two integers as the input, and returns their sum as the output.
package main

import (

type addReq struct {
        Arg1,Arg2 int

type addResp struct {
        Sum int

func addHandler(w http.ResponseWriter, r *http.Request) {
        decoder := json.NewDecoder(r.Body)
        var req addReq
        if err := decoder.Decode(&req); err != nil {
                http.Error(w, err.Error(), http.StatusInternalServerError)
        jsonString, err := json.Marshal(addResp{Sum: req.Arg1 + req.Arg2})
        if err != nil {
                http.Error(w, err.Error(), http.StatusInternalServerError)
        w.Header().Set("Content-Type", "application/json")

func main() {
        http.HandleFunc("/add", addHandler)
        http.ListenAndServe(":8080", nil)
Lets review the code from top to bottom. First we need to import the JSON and HTTP packages into our code. The JSON package provides the functions for parsing and marshaling JSON messages. The HTTP package enables processing HTTP requests. Then we define two data types (addReq and addResp) to represent the incoming JSON request and the outgoing JSON response. Note how addReq contains two integers (Arg1, Arg2) for the two input values, and addResp contains only one integer (Sum) for holding the total.
Next we define what is called a HTTP handler function which implements the logic of our web service. This function simply parses the incoming request, and populates an instance of the addReq struct. Then it creates an instance of the addResp struct, and serializes it into JSON. The resulting JSON string is then written out using the http.ResponseWriter object.
Finally, we have a main function that ties everything together, and starts executing the web service. This main function, simply registers our HTTP handler with the "/add" URL context, and starts an HTTP server on port 8080. This means any requests sent to the "/add" URL will be dispatched to the addHandler function for processing.
That's all there's to it. You may compile and run the program to try it out. Use Curl as follows to send a test request.
curl -v -X POST -d '{"Arg1":5, "Arg2":4}' http://localhost:8080/add
You will get a JSON response back with the total.

Wednesday, December 3, 2014

Controlled Concurrency with Golang

Lately I have been doing a lot of programming in Golang. It is one of those languages which is somewhat difficult to fully grasp at the beginning. But a few hundred lines of code later, you feel like you cannot get enough of it -- very simple syntax, brilliant performance and very clean and precise API semantics. This language has got it all.
Concurrent programming is one area where Golang really excels at. The goroutines that make it trivial to start concurrent threads of execution, channels as a first-class programming construct and a plethora of built-in utilities and packages (e.g. sync) make the developer's life a lot easier. In this post I'm going to give a brief overview on how to instantiate new threads of execution in Golang. Lets start with a piece of sequential code:
for i := 0; i < len(array); i++ {
Above code iterates over an array, and calls the function doSomething for each element in the array. But this code is sequential, which means doSomething(n) won't be called until doSomething(n-1) returns. Suppose you want to speed things up a little bit by running multiple invocations of doSomething in parallel (assuming it is safe to do so -- both control and data wise). In Golang this is all you have to do:
for i := 0; i < len(array); i++ {
  go doSomething(array[i])
The go keyword will start the doSomething function as a separate concurrent goroutine. But this code change causes an uncontrolled concurrency situation. In other words, the only thing that's limiting the number of parallel goroutines spawned by the program is the length of the array, which is not a good idea if the array has thousands of entries. Ideally, we need to put some kind of a fixed cap on how many goroutines are spawned by the loop. This can be easily achieved by using a channel with a fixed capacity.
c := make(chan bool, 8)
for i := 0; i < len(array); i++ {
  c <- true
  go func(index int){
    <- c
We start by creating a channel that can hold at most 8 boolean values. Then inside the loop, whenever we spawn a goroutine, we first send a boolean value (true) into the channel. This operation will get blocked if the channel is already full (i.e it has 8 elements). Then in the goroutine, we remove an element from the channel before we return. This little trick makes sure that at most 8 parallel goroutines will be active in the program at any given time. If you need to change this limit, you simply have to change the max capacity of the channel. You can set this to a fixed number, or write some code to figure out the optimal value based on the number of CPU cores available in the system.

Saturday, November 22, 2014

Running Python from Python

It has been pointed out to me that I don't blog as often as I used to. So here's a first step towards rectifying that.
In this post, I'm going to briefly describe the support that Python provides for processing, well, "Python". If you're using Python for simple scripting and automation tasks, you might often have to load, parse and execute other Python files from your code. While you can always "import" some Python code as a module, and execute it, in many situations it is impossible to determine precisely at the development time, which Python files your code needs to import. Also some Python scripts are written as simple executable files, which are not ideal for inclusion via import. To deal with cases such as these, Python provides several built-in features that allow referring to and executing other Python files.
One of the easiest ways to execute an external Python file is by using the built-in execfile function. This function takes the path to another Python file as the only mandatory argument. Optionally, we can also provide a global and a local namespace. If provided, the external code will be executed within those namespace contexts. This is a great way to exert some control over how certain names mentioned in the external code will be resolved (more on this later).
Another way to include some external code in your script is by using the built-in __import__ function. This is the same function that gets called when we use the usual "import" keyword to include some module. But unlike the keyword, the __import__ function gives you lot more control over certain matters like namespaces.
Another way to run some external Python code from your Python script is to first read the external file contents into memory (as a string), and then use the exec keyword on it. The exec keyword can be used as a function call or as keyword statement.
code_string = load_file_content('/path/to/')
Similar to the execfile function, you have the option of passing custom global and local namespaces. Here's some code I've written for a project that uses the exec keyword:
globals_map = globals().copy()
globals_map['app'] = app
globals_map['assert_app_dependency'] = assert_app_dependency
globals_map['assert_not_app_dependency'] = assert_not_app_dependency
globals_map['assert_app_dependency_in_range'] = assert_app_dependency_in_range
globals_map['assert_true'] = assert_true
globals_map['assert_false'] = assert_false
globals_map['compare_versions'] = compare_versions
    exec(self.source_code, globals_map, {})
except Exception as ex:
    utils.log('[{0}] Unexpected policy exception: {1}'.format(, ex))
Here I first create a clone of the current global namespace, and pass it as an argument to the exec function. The clone is discarded at the end of the execution. This makes sure that the code in the external file does not pollute my existing global namespace. I also add some of my own variables and functions (e.g assert_true, assert_false etc.) into the global namespace clone, which allows the external code to refer to them as built-in constructs. In other words, the external script can be written in a slightly extended version of Python.
There are other neat little tricks you can do using the constructs like exec and execfile. Go through the official documentation for more details.

Wednesday, May 14, 2014

Java Code Analysis and Optimization with Soot

This is a quick shout out about the project Soot. If you're doing anything even remotely related to static analysis in Java, Soot is the way to go. It's simple, open source, well documented and extremely powerful. Soot can analyze any Java program (source or bytecode), and provide you with the control flow graph (CFG). Here's an example that shows how to construct the CFG for the main method of a class named MyClass.
SootClass c = Scene.v().loadClassAndSupport("MyClass");
SootMethod m = c.getMethodByName("main");
Body b = m.retrieveActiveBody();
UnitGraph g = new BriefUnitGraph(b);
Once you get your hands on the CFG, you can walk it, search it and do anything else you would normally do with a graph data structure. 
Soot converts Java code into one of four intermediate representations (Jimple, Baf, Shimple and Grimp). These representations are designed to make it easier to analyze programs written in Java. For example, Jimple maps Java code from its typical stack-based model to a three-registers-based model. You can also make modifications/optimizations to the code and try out new ideas for compiler and runtime optimizations. Alternatively you can "tag" instructions with metadata which can be helpful in building new development tools with powerful code visualization capabilities.
Soot also provides a set of APIs for performing data flow analysis. These APIs can help you to code anything from live variable analysis to very busy expression analysis and more. And finally, Soot can also be invoked from the command-line without having to write any extension code.
So if you have any cool new ideas related to program analysis or optimization, grab the latest version of Soot. Whatever it is that you're trying to do, I'm sure Soot can help you implement it.