Tuesday, May 5, 2015

Parsing Line-Oriented Text Files Using Go

The following example demonstrates several features of Golang, such as reading a file line-by-line (with error handling), deferred statements and higher order functions.
package main

import (
 "bufio"
 "fmt"
 "os"
)

func ParseLines(filePath string, parse func(string) (string,bool)) ([]string, error) {
  inputFile, err := os.Open(filePath)
  if err != nil {
    return nil, err
  }
  defer inputFile.Close()

  scanner := bufio.NewScanner(inputFile)
  var results []string
  for scanner.Scan() {
    if output, add := parse(scanner.Text()); add {
      results = append(results, output)
    }
  }
  if err := scanner.Err(); err != nil {
    return nil, err
  }
  return results, nil
}

func main() {
  if len(os.Args) != 2 {
    fmt.Println("Usage: line_parser ")
    return
  }

  lines, err := ParseLines(os.Args[1], func(s string)(string,bool){ 
    return s, true
  })
  if err != nil {
    fmt.Println("Error while parsing file", err)
    return
  }

  for _, l := range lines {
    fmt.Println(l)
  }
}
The ParseLines function takes a path (filePath) to an input file, and a function (parse) that will be applied on each line read from the input file. The parse function should return a [string,boolean] pair, where the boolean value indicates whether the string should be added to the final result of ParseLines or not. The example shows how to simply read and print all the lines of the input file.
The caller can inject more sophisticated transformation and filtering logic into ParseLines via the parse function. The following example invocation filters out all the strings that do not begin with the prefix "[valid]", and extracts the 3rd field from each line (assuming a simple whitespace separated line format).
lines, err := ParseLines(os.Args[1], func(s string)(string,bool){
   if strings.HasPrefix(s, "[valid] ") {
     return strings.Fields(s)[2], true
   }
   return s, false
})
A function like ParseLines is suitable for parsing small to moderately large files. However, if the input file is very large, ParseLines may cause some issues, since it accumulates the results in memory.

No comments: