POLYGLOT PROGRAMMING

Better poly than sorry!

Merging ZSH history files from different hosts with Scala

Learning Scala - along with Groovy, Java and JVM ecosystem - was fun!

Last updated on:

NOTE: As usual, all the code is in a GitHub repository.

Motivation

For the past two years I've been dabbling in DIY home automation. I ended up with a bunch of Raspberry PIs[1] all over the house, all running under Linux and connected via WiFi. They service some sensors and cameras, and show current date and time (among other dashboard-y things, like calendar or weather forecast) on a few displays around the house.

I have a few Ansible[2] playbooks set up for running things like package- and system updates, but I still often ssh to the PIs for one-off tasks (in the beginning they all look like one-offs...) To do this comfortably, I installed ZSH[3] - my shell of choice - on the PIs, using the same .zshrc config everywhere. Just a handful of ifs was enough to make my config portable, which was a pleasant surprise.

One of the CLI tools I have set up in my .zshrc is fzf[4], which I use as a history search mechanism, replacing the default one under Control + r. It's incredibly useful, as it lets you narrow down the search incrementally and interactively, with fuzzy matching, which makes arriving at the intended command much faster than with the built-in search. You could say I got addicted to the ease of both browsing and searching through the shell history in an interactive manner. However, no matter what search method you use, if something's not there, you won't find it! That sounds trivial, but it connects to the topic of this post: how to merge ZSH history files.

The use case should be clear by this point: very often, whatever I did on one PI, I would need to do, sooner or later, on one or two others. In such cases I, reflexively, tried to search for the needed command in the history of the current shell, which never saw that command... Obviously, I couldn't find it.

Going through a few minutes of frustration a couple times was enough for me to start thinking about an automatic solution. How nice would it be if something gathered the histories from all the hosts and merged them into a single one, I thought.

Why Scala? Groovy? JVM?

As usual: by accident. It just so happened that I was learning Scala (and JVM in general) at the time for work, so I decided to use the urge to automate as a material for practice. As many of us do all the time, I thought, how hard could it be? and started setting up a project, which you can see on GitHub. In this post, I want to highlight nice things about Scala that I learned in the process.

Project setup using Gradle

I'm not against the use of Scala build tool, sbt. I chose Gradle[5] simply because I had some prior experience with it, from my previous JVM project. That project had parts written in Java, Groovy, and Scala, and Gradle was the first tool we found which handled this case without tons of boilerplate code. Gradle has plugins for nearly everything, lightweight config syntax, and good performance. When starting the current project, I simply copy-pasted the build.gradle from the previous one, which was the fastest way to get started. The file looks roughly like this (click to unfold):

plugins {
  id 'application'
  id 'scala'
}

application { mainClassName 'zsh.history.Main' }

sourceSets {
  main {
    scala { srcDirs = ['src/scala/'] }
  }
}

repositories { jcenter() }

dependencies {
  implementation 'com.github.pathikrit:better-files_2.13:3.8.0'
  implementation 'com.github.nscala-time:nscala-time_2.13:2.22.0'
  implementation 'org.scala-lang:scala-dist:2.13.1'
  implementation 'org.scala-lang.modules:scala-parser-combinators_2.13:1.1.2'
  implementation 'org.scala-lang.modules:scala-parallel-collections_2.13:0.2.0'
}

You execute tasks configured in build.gradle with gradle command, which works kind of like make (or gulp, or mix, etc.). For example, the plugin called 'application' defines a task called 'installDist' which compiles and packages the project, giving you a script which starts your application. To compile and run the project in one go, I ended up with a following script, run.sh:

1
2
3
4
5
6
#! /usr/bin/env bash

set -e

gradle installDist
./build/install/zsh-merge-hist/bin/zsh-merge-hist

It's worth noting that you can define your own tasks in build.gradle, using the full Groovy[6] language. Here's an example which creates scripts for running REPLs with the whole project on the CLASSPATH:

File mkfile(obj) { obj instanceof File ? obj : new File(obj.toString()) }
task makeScripts(type: DefaultTask) {
    String shebang = "#! /usr/bin/env sh\n"
    String root = projectDir.absolutePath
    String pre = ""
    File outFile

    String classpath = sourceSets.main.runtimeClasspath.toList()*.toString().join(":")
    try {
        String opts = mkfile(root + "/.java").text.split("\n").join(" ")
        pre += "\nexport JAVA_OPTS='$opts'\n\n"
    }
    catch(_){}

    outFile = mkfile("-scala")
    outFile.write shebang + pre + "scala -classpath ${classpath}\n"
    outFile.setExecutable(true, true)

    outFile = mkfile("-amm")
    outFile.write shebang + pre + "java -cp ${classpath}:/usr/local/bin/amm ammonite.Main\n"
    outFile.setExecutable(true, true)

    ArrayList<String> jshell_cp = classpath.split(":").findAll({
        // TODO: filter out all nonexistent dirs, not just the blacklisted ones
        !(it.contains("build/classes/java/main") ||
          it.contains("build/resources/main"))
    })
    outFile = mkfile("-jshell")
    outFile.write shebang + pre + "jshell --class-path ${jshell_cp.join(':')}\n"
    outFile.setExecutable(true, true)
}

Interactive shell is always nice to have, and one of the three, called Ammonite[7], is a Scala equivalent of IPython[8], and is a pleasure to work with - I used it extensively to experiment with unfamiliar libraries, for example.

At this point the setup is done. Let's start examining the implementation.

Fetching history files with SCP & Scala external commands

Recounting the problem: I have a number of hosts on the network, where I tend to log in via ssh - which means I have authorized_keys properly configured already. I would like the shell history, which is a plain text file, to be synchronized among all the hosts. To do this, I need to fetch the history files from the network, process them somehow and put the merged file back on the hosts.

In this situation, using scp[9] command seemed the easiest way of copying files across my network. It's not Scala-specific, but there's no need to reinvent the wheel and write custom networking code for something this simple. Fortunately, Scala has a simple DSL for running external programs built-in, in the sys.process namespace (see the docs.)

While the DSL is powerful - even chaining I/O of a sequence of programs (like what the | operator does in most shells) is possible - what I really needed was just a return code. If it's equal to 0, we know the transfer finished successfully; anything else signifies an error. To execute a command and get the exit code, you use ! method, implicitly added to Strings and Seqs:

object Transfer {
  import scala.sys.process._
  import better.files.File
  import Configuration.config

  type DownloadInfo = (String, File)
  type DownloadResult = Either[DownloadInfo, DownloadInfo]

  def downloadSingleFileFrom(host: String): DownloadResult = {
    val (from, path) = config.connectionData(host)
    val localHistFile = File(path)

    println(s"Transfer $from to $path")
    val returnCode = s"scp $from $path".!  // could be: Seq("scp", from, path).!
    if (returnCode != 0 || !localHistFile.exists) {
      // clear the empty or partial file, don't throw if it doesn't exist at all
      localHistFile.delete(swallowIOExceptions=true)
      Left((from, localHistFile))
    }
    else {
      Right((from, localHistFile))
    }
  }
  // ...
}

Downloading multiple files at a time with parallel collections

Most of my PIs are connected via WiFi, some of the older ones even use the 2.5Ghz network. The transfer is not slow enough to be irritating, but that's only if you need something from a single host. If you try downloading many files in sequence, the latency will add up to the level where it's noticeable.

Scala provides a wonderful, high-level construct for parallelizing processing: parallel collections. Basically, whenever you have code which maps (or folds) a function over a collection, you can transform said collection into a parallel one. Parallel collections implement map method which executes given function on a pool of threads. The result is another parallel collection, which you can transform back into a normal one, which will have the results of computations, in the correct order. The toList method which does this blocks and waits for all the items in all the threads to finish being processed. It looks like this:

  // ...
  def downloadHistoryFiles(allowMissing: Switch = Switch.T): List[File] = {
    require(config.hosts.length > 0)
    config.createDownloadDirectory()

    val downloads = config.hosts.par.map(downloadSingleFileFrom _).toList
    val (failures, successes) = downloads.partition(_.isLeft)
    if (!allowMissing && failures.nonEmpty) {
      throw new TransferError(failures map unwrapFailure)
    }
    successes.map(_.toOption.map(_._2).get)
  }
  // ...

The interesting part is in this expression:

    hosts.par.map(downloadSingleFileFrom _).toList

Which is roughly equivalent to the following Python code:

import os
from concurrent.futures import ThreadPoolExecutor

hosts = [ ... ]
futures = []
results = []

with ThreadPoolExecutor(max_workers=os.cpu_count()) as executor:
    for host in hosts:
        futures.append(executor.submit(downloadSingleFileFrom, host))
    for future in futures:
        results.append(future.result())

Parallel collections take care of all the bookkeeping behind the scenes, so that you can have a "scatter & gather"[10] concurrency and parallelism without having to do any scattering or gathering yourself.

One thing worth considering is if our use of parallel collections is safe. The question arises because calling external processes is very clearly a side-effect, and side-effects are generally bad fit for concurrency. In this case, however, we know that each application of downloadSingleFileFrom gets a file from a different server and saves that file under a path different from all the other calls. As there is nothing shared between calls, they are safe to execute concurrently.

Parsing ZSH history file format

ZSH has two formats of history files: simple and extended. The extended one, which I use, looks like this:

1
2
3
: 1538336195:0;history 1
: 1538336321:0;cat mix.exs
: 1538336321:0;find . -iname rye

Each history entry (note: not "each line", because entries can span multiple lines) starts with a colon and space - : - which is followed by the timestamp in typical UNIX format, then by time taken by the command to execute, and finally the command itself.

Commands spanning multiple lines are problematic - we can't just split the file contents on newlines to get the list of commands. Other than that, the format is not very complex and it can be parsed with a few regexes.

Personally, though, whenever I think about a few regexes - not just one or two - I tend to go for a more powerful and structured parsing tool. Scala provides a parser combinator library[11] with a DSL for defining the structure of the text to be parsed. Parser combinators are a functional way of encoding recursive descent parsing, and so are able to parse both contex-free and context-sensitive grammars; we don't need its full power here, but it's good to know we can deal with more complex formats with the same tool.

The parser is defined as an object which extends Parsers subclass, RegexParsers:

  import scala.util.parsing.combinator.RegexParsers

  type ResultData = List[(Long, Int, String)]
  type ParseResult = SimpleParser.ParseResult[ResultData]

  object SimpleParser extends RegexParsers {
    override def skipWhitespace = false

    def nl = "\n"
    def semi = ";"
    def colon = ":"
    def digits = "[0-9]+".r
    def noNewLine = "[^\n]+".r

    def start = nl ~ colon
    def elapsed = digits <~ semi
    def timestamp = start ~> " " ~> digits <~ colon
    def command = ( noNewLine | (not(start) ~> nl) ).+ <~ guard(start)

    def line = timestamp ~ elapsed ~ command ^^ {
      case ts ~ el ~ cmd => (ts.toLong, el.toInt, cmd.mkString.trim)
    }

    def lines = line.+
  }

The parser is an object which defines a series of methods returning Parser instances. There are implicit conversions defined for Strings and Regexex that convert them into Parsers - which is why the nl to noNewLine definitions work. The parsers themselves are defined as functions which take a stream as input and return a result along with the rest of the stream (to be parsed by following parsers.)

After defining the basic building blocks of the syntax, we define the grammar using parser combinators: methods ~>, <~, and ~. They all express sequencing, ie. the parser created by parser1 ~ parser2 will match whatever parser1 matches, then will try to match parser2. The difference between the three methods has to do with the results of parsing: the basic ~ operator means combine the results of both parsers into one, while ~> means discard results of the first parser, return whatever the second returns. The <~ operator discards the results of the second parser instead.

There is an alternative operator, spelled parser1 | parser2 - the same way as in regexps - which parses whatever either parser1 parses, or whatever parser2 parses. Another important combinator is parser.+, which parses repetitions of whatever parser parses. It puts the results in a List. Finally, we have guard(parser) combinator, which parses whatever parser parses, but doesn't advance the position in the input stream. This is known as lookahead assertion in regexes.

The last operator, parser ^^ fun, applies fun to the results of parser. Whatever fun returns becomes the new parsing result. Quoting the docs: If you've done yacc parsing, the left hand side of the ^^ corresponds to the grammar rule and the right hand side corresponds to the code generated by the rule. This is used for cleaning up and structuring parsing results, as you can see in the example, where the combined results of three parsers are pattern matched on and transformed into a single 3-tuple (lines 20-22).

To actually use the parser, you do something like this:

1
2
3
4
5
6
7
8
  def parseHistory(input: java.io.InputStream): Either[String, ResultData] = {
    val reader = new java.io.InputStreamReader(input, "UTF-8")
    SimpleParser.parse(SimpleParser.lines, reader) match {
      case SimpleParser.Success(lines, _) => Right(lines)
      case SimpleParser.Failure(msg, _) => Left(msg)
      case SimpleParser.Error(msg, _) => Left(msg)
    }
  }

As you can see, there are two failure modes: Failure and Error. Because I don't care about the distinction, I transform the result into Either, with Right meaning success. The ignored part of the patterns, _, is the mentioned rest of the input, which should be empty.

This would normally work, but the parseHistory, as it is, fails on some history files. What is happenning, and why?

(Un)Metafying ZSH history

Turns out ZSH escapes some characters in a special way when writing and reading history files. As far as I can tell, this mechanism is there to make sure characters special to the shell, like $, !, ~, etc., are not interpreted/executed by accident. It unfortunately causes a problem with string encoding; in Java this means you get the following exception when you try reading the file:

    jshell> var p = java.nio.file.FileSystems.getDefault().getPath("/home/cji/mgmnt/zsh_history")
    p ==> /home/cji/mgmnt/zsh_history
    jshell> var cs = java.nio.charset.StandardCharsets.UTF_8
    cs ==> UTF-8
    jshell> java.nio.file.Files.readAllLines(p,cs)
    |  Exception java.nio.charset.MalformedInputException: Input length = 1
    |        at CoderResult.throwException (CoderResult.java:274)
    |        at StreamDecoder.implRead (StreamDecoder.java:339)
    |        at StreamDecoder.read (StreamDecoder.java:178)
    |        at InputStreamReader.read (InputStreamReader.java:185)
    |        at BufferedReader.fill (BufferedReader.java:161)
    |        at BufferedReader.readLine (BufferedReader.java:326)
    |        at BufferedReader.readLine (BufferedReader.java:392)
    |        at Files.readAllLines (Files.java:3330)
    |        at (#3:1)

Now, ZSH mostly can handle UTF-8 text. Or rather, UTF is defined in such a way that it's mostly backward compatible: as long as a program doesn't do anything special with character codes above 127, it should be able to handle UTF-8 transparently. ZSH escaping, however, uses codes above 0x83 (that is 131 in decimal, 0b10000011 in binary) to encode its special characters. It does this without accounting for a variable-width encoding of UTF-8, breaking the encoding in the process.

The history file is written in the metafied - meaning escaped - format. To make it UTF8-clean, we need to reverse the escaping. In ZSH source, there's a function called unmetafy, which looks like this:

#define Meta 0x83

mod_export char *
unmetafy(char *s, int *len)
{
    char *p, *t;

    for (p = s; *p && *p != Meta; p++);
    for (t = p; (*t = *p++);)
	if (*t++ == Meta && *p)
	    t[-1] = *p++ ^ 32;
    if (len)
	*len = t - s;
    return s;
}

The metafication inserts a special character, Meta, before special character, which is then XORed with 32. As XOR is its own inverse, unmetafying simply removes every Meta character encountered, and XORs the following character with 32 again. The code is terse and efficient, as it does this in-place - no new string is allocated. My reimplementation in Scala is less efficient, as it creates a copy of the input string, but it's not really an issue, given the amount of RAM available vs. the lenght of the history file. For reference, currently my history file is 2.4Mb, while my computer has 32Gb of RAM... Anyway, in Scala it looks like this:

object Unmetafy {
  val Meta = 0x83.toByte  // On the JVM bytes can only be signed, so numbers
                          // above 127 need to be be converted
  def unmetafy(file: File): String = unmetafy(file.byteArray)

  def unmetafy(bytes: Array[Byte]): String = {
    val it = bytes.iterator
    val out = new ArrayBuffer[Byte](bytes.length)
    while (it.hasNext) {
      val byte = it.next()
      if (byte == Meta)
        out.addOne((it.next() ^ 32).toByte)
      else
        out.addOne(byte)
    }
    new String(out.toArray, "UTF-8")
  }
}

One curious thing I learned, which is visible in this example, is that on the JVM there is no unsigned byte type. What it means is that values above 127 are interpreted as negative; this is known as two's complement, where the highest bit encodes a sign. What's important to note is that the value - ie. the bits set - stays the same, it's just the interpretation that changes. This means we can write the Array of Bytes back into a file without doing anything special - we only need to convert values over 127 to Byte to satisfy the type checker.

Dumping merged history back into a file

After parsing, we have a number of Lists of tuples. Many - almost all, actually - of the tuples are duplicates, which we need to remove. I do this by joining all the lists, then sorting by timestamp (first element of a tuple), then removing duplicates, and finally removing repeating commands (third element of tuples) from the list. It looks like this:

object Transform {
  import better.files.File
  import Parsing.ResultData
  import Configuration.config

  def removeDuplicates(lines: ResultData): ResultData =
    lines.sortBy(_._1).distinct

  def removeRepeatedCommands(lines: ResultData): ResultData =
    lines.reverse.distinctBy(_._3).reverse

  def processHistoryFiles(): ResultData =
    processHistoryPaths(config.hosts.map(config.getPathForHost))

  def processHistoryFiles(files: List[File]): ResultData = {
    val parsedFiles =
      files.par.map({ dest =>
        println(s"Parsing ${dest}...")
        val inputString = Unmetafy.unmetafy(dest)
        val Right(parsed) = Parsing.parseHistory(inputString): @unchecked
        parsed
      })
    removeRepeatedCommands(removeDuplicates(List.concat(parsedFiles.toList: _*)))
  }
}

As you can see, parsing also happens in parallel, thanks to parallel collections. The main function, processHistoryFiles, returns Parsing.ResultData, which is an alias for a list of triples: List[(Long, Int, String)].

What's left is just dumping the results back into a file:

object Dumping {
  import better.files._
  import Parsing.ResultData
  import Configuration.config

  def renderLine(ts: Long, rt: Int, cmd: String) =
    s": $ts:$rt;$cmd\n"

  def dumpResultsToFile(res: ResultData): Unit =
    dumpResultsToFile(res, config.getPathForHost("merged"))

  def dumpResultsToFile(res: ResultData, path: String): Unit = {
    println(s"Dumping parsed data to ${path}")
    dumpResults(res, File(path).newOutputStream)
  }

  def dumpResults(res: ResultData, out: java.io.OutputStream = System.out): Unit = {
    Using.resource(out) { out =>
      for( (ts, elapsed, command) <- res ) {
        out.write(renderLine(ts, elapsed, command).getBytes("UTF-8"))
      }
    }
  }
}

The code here is straightforward, one thing worth noting is line 18 - the Using.resource() construct[12]. Scala doesn't have try-with-resources[13] which is used to always free, or close, a resource, when control exits from a block - both normally and via exception. In Python, you would use with statement; Scala defines a higher-order function which does the same.

Configuration and reading JSON data

Instead of hardcoding a list of hosts, I decided to create a configuration file, .mergerc. It is a JSON file - mostly because I wanted to try using JSON with Scala. The configuration format looks like this:

1
2
3
4
5
{
  "hosts": ["vps1", "kuro", "shiro", "midori", "sora", "..."],
  "tempDir": "./tmp",
  "sourcePath": "mgmnt/zsh_history"
}

To read the configuration, I first created a case class, imaginatively named Configuration, which has the fields corresponding to JSON attributes:

1
2
3
4
5
6
7
case class Configuration(
  val hosts: List[String],
  val tempDir: String = "./tmp/",
  val sourcePath: String = "~/.zsh_history",
) {
  // ...
}

The class defines some helper methods which are not essential, so I elided them here. Then, there's a companion object, also called Configuration, which handles loading the JSON data and mapping them into Configuration instance. It looks like this:

import com.fasterxml.jackson.databind._
import com.fasterxml.jackson.module.scala.DefaultScalaModule

object Configuration {
  import better.files.File
  import org.joda.time.Instant

  val settingsFile = (File.home / ".mergerc")

  val mapper = new ObjectMapper()
  mapper.registerModule(DefaultScalaModule)

  def loadConfig = mapper.readValue(
    settingsFile.toJava,
    classOf[Configuration]
  )

  if (!settingsFile.exists)
    throw new RuntimeException("No configuration file found!")

  val config = loadConfig()
}

Well, it's still longer and more verbose than a simple JSON.parse() in JavaScript would be, but I have to say - it's not bad, given the statically typed nature of Scala. The magic apparently happens in the ObjectMapper object, which is customized for Scala with the .registerModule(DefaultScalaModule) call. I have no idea how that works - probably via reflection, but I don't know Scala meta-programming well enough to say for sure. Still, all you have to do is define a class with fields of correct types, and you get an instance of that class with correctly parsed values automatically. Not bad.

That's it!

That's basically all the code in the project - there are some additional utilities here and there, but I covered all the essential parts of the project. I use it regularly to merge ZSH history on a number of hosts, currently 8, and it works without a hitch. A single run takes around 30 seconds, while subsequent runs take around 7-8 seconds. Most of the time is spent downloading and uploading files, with parsing, sorting, and dumping merged history taking less than a second. This is for 9 hosts and a history file(s) of around 2.5Mb - not bad I think.

Above all else, though, I learned a lot, about Scala and the JVM, and string encodings on top of that. There were frustrating moments, but overall it was a pleasant and fun journey. I hope you enjoyed reading about it - let me know what you think about the post in the comments.

That's it - thanks for reading!

Comments