Saturday, June 20, 2015

Expose Any Shell Command or Script as a Web API

I implemented a tool that can expose any shell command or script as a simple web API. All you have to specify is the binary (command/script) that needs to be exposed, and optionally a port number for the HTTP server. Source code of the tool in its entirety is shown below. In addition to exposing simple web APIs, this code also shows how to use Golang's built-in logging package, slice to varargs conversion and a couple of other neat tricks.
// This tool exposes any binary (shell command/script) as an HTTP service.
// A remote client can trigger the execution of the command by sending
// a simple HTTP request. The output of the command execution is sent
// back to the client in plain text format.
package main

import (
 "flag"
 "fmt"
 "io/ioutil"
 "log"
 "net/http"
 "os"
 "os/exec"
 "strings"
)

func main() {
 binary := flag.String("b", "", "Path to the executable binary")
 port := flag.Int("p", 8080, "HTTP port to listen on")
 flag.Parse()

 if *binary == "" {
  fmt.Println("Path to binary not specified.")
  return
 }

 l := log.New(os.Stdout, "", log.Ldate|log.Ltime)
 http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
  var argString string
  if r.Body != nil {
   data, err := ioutil.ReadAll(r.Body)
   if err != nil {
    l.Print(err)
    http.Error(w, err.Error(), http.StatusInternalServerError)
    return
   }
   argString = string(data)
  }

  fields := strings.Fields(*binary)
  args := append(fields[1:], strings.Fields(argString)...)
  l.Printf("Command: [%s %s]", fields[0], strings.Join(args, " "))

  output, err := exec.Command(fields[0], args...).Output()
  if err != nil {
   http.Error(w, err.Error(), http.StatusInternalServerError)
   return
  }
  w.Header().Set("Content-Type", "text/plain")
  w.Write(output)
 })

 l.Printf("Listening on port %d...", *port)
 l.Printf("Exposed binary: %s", *binary)
 http.ListenAndServe(fmt.Sprintf("127.0.0.1:%d", *port), nil)
}
Clients invoke the web API by sending HTTP GET and POST requests. Clients can also send in additional flags and arguments to be passed into the command/script wrapped within the web API. Result of the command/script execution is sent back to the client as a plain text payload.
As an example, assume you need to expose the "date" command as a web API. You can simply run the tool as follows:
./bash2http -b date
Now, the clients can invoke the API by sending an HTTP request to http://host:8080. The tool will run the "date" command on the server, and send the resulting text back to the client. Similarly, to expose the "ls" command with the "-l" flag (i.e. long output format), we can execute the tool as follows:
./bash2http -b "ls -l"
Users sending an HTTP request to http://host:8080 will now get a file listing (in the long output format of course), of the current directory of the server. Alternatively users can POST additional flags and a file path to the web API, to get a more specific output. For instance:
curl -v -X POST -d "-h /usr/local" http://host:8080
This will return a file listing of /usr/local directory of the server with human-readable file size information.
You can also use this tool to expose custom shell scripts and other command-line programs. For example, if you have a Python script foo.py which you wish to expose as a web API, all you have to do is:
./bash2http -b "python foo.py"

Monday, June 8, 2015

Exposing a Local Directory Through a Web Server

Did you ever encounter a situation where you have to serve the contents of a directory in the local file system through a web server? Usually this scenario occurs when you want to quickly try out some HTML+JS+CSS combo, or when you want to temporarily share the directory with a remote user. How would you go about doing this? Setting up Apache HTTP server or something similar could take time. And you definitely don't want to be writing any new code for achieving such a simple goal. Ideally, what you want is a simple command, that when executed starts serving the current directory through a web server.
The good news is, if you have Python installed on your machine, you already have access to such a command:
python -m SimpleHTTPServer 8000
The last argument (8000) is the port number for the HTTP server. This will spawn a lightweight HTTP server, using the current directory as the document root. Hit ctrl+c to kill the server process when you're done with it.
Alternatively you can write your own solution, and install it permanently into the system so you reuse it in the future. Here's a working solution written in Go:
package main

import (
 "log"
 "net/http"
)

func main() {
 log.Fatal(http.ListenAndServe(":8080", http.FileServer(http.Dir("."))))
}
The port number (8080) is hardcoded into the solution, but it's not that hard to change it.

Wednesday, May 13, 2015

Using Java Thread Pools

Here's a quick (and somewhat dirty) solution in Java to process a set of tasks in parallel. It does not require any third party libraries. Users can specify the tasks to be executed by implementing the Task interface. Then, a collection of Task instances can be passed to the TaskFarm.processInParallel method. This method will farm out the tasks to a thread pool and wait for them to finish. When all tasks have finished, it will gather their outputs, put them in another collection, and return it as the final outcome of the method invocation.
This solution also provides some control over the number of threads that will be employed to process the tasks. If a positive value is provided as the max argument, it will use a fixed thread pool with an unbounded queue to ensure that no more than 'max' tasks will be executed in parallel at any time. By specifying a non-positive value for the max argument, the caller can request the TaskFarm to use as many threads as needed.
If any of the Task instances throw an exception, the processInParallel method will also throw an exception.
package edu.ucsb.cs.eager;

import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.concurrent.*;

public class TaskFarm<T> {

    /**
     * Process a collection of tasks in parallel. Wait for all tasks to finish, and then
     * return all the results as a collection.
     *
     * @param tasks The collection of tasks to be processed
     * @param max Maximum number of parallel threads to employ (non-positive values
     *            indicate no upper limit on the thread count)
     * @return A collection of results
     * @throws Exception If at least one of the tasks fail to complete normally
     */
    public Collection<T> processInParallel(Collection<Task<T>> tasks, int max) throws Exception {
        ExecutorService exec;
        if (max <= 0) {
            exec = Executors.newCachedThreadPool();
        } else {
            exec = Executors.newFixedThreadPool(max);
        }

        try {
            List<Future<T>> futures = new ArrayList<>();

            // farm it out...
            for (Task t : tasks) {
                final Task task = t;
                Future f = exec.submit(new Callable<T>() {
                    @Override
                    public T call() throws Exception {
                        return task.process();
                    }
                });
                futures.add(f);
            }

            List<T> results = new ArrayList<>();

            // wait for the results
            for (Future f : futures) {
                results.add(f.get());
            }
            return results;
        } finally {
            exec.shutdownNow();
        }
    }

}

Tuesday, May 5, 2015

Parsing Line-Oriented Text Files Using Go

The following example demonstrates several features of Golang, such as reading a file line-by-line (with error handling), deferred statements and higher order functions.
package main

import (
 "bufio"
 "fmt"
 "os"
)

func ParseLines(filePath string, parse func(string) (string,bool)) ([]string, error) {
  inputFile, err := os.Open(filePath)
  if err != nil {
    return nil, err
  }
  defer inputFile.Close()

  scanner := bufio.NewScanner(inputFile)
  var results []string
  for scanner.Scan() {
    if output, add := parse(scanner.Text()); add {
      results = append(results, output)
    }
  }
  if err := scanner.Err(); err != nil {
    return nil, err
  }
  return results, nil
}

func main() {
  if len(os.Args) != 2 {
    fmt.Println("Usage: line_parser ")
    return
  }

  lines, err := ParseLines(os.Args[1], func(s string)(string,bool){ 
    return s, true
  })
  if err != nil {
    fmt.Println("Error while parsing file", err)
    return
  }

  for _, l := range lines {
    fmt.Println(l)
  }
}
The ParseLines function takes a path (filePath) to an input file, and a function (parse) that will be applied on each line read from the input file. The parse function should return a [string,boolean] pair, where the boolean value indicates whether the string should be added to the final result of ParseLines or not. The example shows how to simply read and print all the lines of the input file.
The caller can inject more sophisticated transformation and filtering logic into ParseLines via the parse function. The following example invocation filters out all the strings that do not begin with the prefix "[valid]", and extracts the 3rd field from each line (assuming a simple whitespace separated line format).
lines, err := ParseLines(os.Args[1], func(s string)(string,bool){
   if strings.HasPrefix(s, "[valid] ") {
     return strings.Fields(s)[2], true
   }
   return s, false
})
A function like ParseLines is suitable for parsing small to moderately large files. However, if the input file is very large, ParseLines may cause some issues, since it accumulates the results in memory.

Friday, March 20, 2015

QBETS: A Time Series Analysis and Forecasting Method

Today I’m going to share some details on an analytics technology I’ve been using for my research.
QBETS (Queue Bounds Estimation from Time Series) is a non-parametric time series analysis method. The basic idea behind QBETS is to analyze a time series, and predict the p-th percentile of it, where p is a user-specified parameter. QBETS learns from the existing data points in the input time series, and estimates a p-th percentile value such that the next data point in the time series has a 0.01p probability of being less than or equal to the estimated value.
For example, suppose we have the following input time series, and we wish to predict the 95th percentile of it:

A0, A1, A2, …., An

If QBETS predicts the value Q as the 95th percentile, we can say that An+1 (the next data point that will be added to the time series by the generating process), has a 95% chance of being less than or equal to Q.

P(An+1 <= Q) = 0.01p

Since QBETS cannot determine the percentile values exactly, but must estimate them, it uses another parameter c (0 < c < 1) as an upper confidence bound on the estimated values. That is, if QBETS was used to estimate the p-th percentile value of a time series with c upper confidence, it would have overestimated the p-th percentile with a probability of 1 – c. For instance, if c = 0.05, then QBETS will generate predictions that overestimate the true p-th percentile 95% of the time. We primarily use parameter c as a means of controlling how conservative we want QBETS to be, when predicting percentiles.
QBETS also supports a technique known as change point detection. To understand what this means, let’s look at the following input time series.

7, 8, 7, 7, 9, 8, 7, 7, 15, 15, 16, 14, 16, 17,15

Here we see a sudden shift in the values after the first 8 data points. The individual data point values have increased from the 7-9 range to 14-17 range. QBETS detects such change points in the time series, and takes action to discard the data points before the change point. This is necessary to make sure that the predictions are not influenced by old historical values that are no longer relevant in the time series generating process.
The paper that originally introduced QBETS, used it as a mechanism to predict the scheduling delays in batch queuing systems for supercomputers and other HPC systems. Over the years researchers have used QBETS with a wide range of datasets, and it has produced positive results in almost all the cases. Lately, I have been using QBETS as a means of predicting API response times, by analyzing historical API performance data. Again, the results have been quite promising.

To learn more about QBETS, go through the paper or contact the authors.

Sunday, January 11, 2015

Creating Eucalyptus Machine Images from a Running VM

I often use Eucalyptus private cloud platform for my research. And very often I need to start Linux VMs in Eucalyptus, and install a whole stack of software on them. This involves a lot of repetitive work, so in order to save time I prefer creating machine images (EMIs) from fully configured VMs. This post outlines the steps one should follow to create an EMI from a VM running in Eucalyptus (tested on Ubuntu Lucid and Precise VMs).

Step 1: SSH into the VM running in Eucalyptus, if you already haven't.

Step 2: Run euca-bundle-vol command to create an image file (snapshot) from the VM's root file system.
euca-bundle-vol -p root -d /mnt -s 10240
Here "-p" is the name you wish to give to the image file. "-s" is the size of the image in megabytes. In the above example, this is set to 10GB, which also happens to be the largest acceptable value for "-s" argument. "-d" is the directory in which the image file should be placed. Make sure this directory has enough free space to accommodate the image size specified in "-s". 
This command may take several minutes to execute. For a 10GB image, it may take around 3 to 8 minutes. When completed, check the contents of the directory specified in argument "-d". You will see an XML manifest file and a number of image part files in there.

Step 3: Upload the image file to the Eucalyptus cloud using the euca-upload-bundle command.
euca-upload-bundle -b my-test-image -m /mnt/root.manifest.xml
Here "-b" is the name of the bucket (in Walrus key-value store) to which the image file should be uploaded. You don't have to create the bucket beforehand. This command will create the bucket if it doesn't already exist. "-m" should point to the XML manifest file generated in the previous step.
This command requires certain environment variables to be exported (primarily access keys and certificate paths). The easiest way to do that is to copy your eucarc file and the associated keys into the VM and source the eucarc file into the environment.
This command also may take several minutes to complete. At the end, it will output a string of the form "bucket-name/manifest-file-name".

Step 4: Register the newly uploaded image file with Eucalyptus.
euca-register my-test-image/root.manifest.xml
The only parameter required here is the "bucket-name/manifest-file-name" string returned from the previous step. I've noticed that in some cases, running this command from the VM in Eucalyptus doesn't work (you will get an error saying 404 not found). In that case you can simply run the command from somewhere else -- somewhere outside the Eucalyptus cloud. If all goes well, the command will return with an EMI ID. At this point you can launch instances of your image using the euca-run-instances command.