Merging ZSH history files from different hosts with Scala ¶
Learning Scala - along with Groovy, Java and JVM ecosystem - was fun!
Motivation
For the past two years I've been dabbling in DIY home automation. I ended up with a bunch of Raspberry PIs[1] all over the house, all running under Linux and connected via WiFi. They service some sensors and cameras, and show current date and time (among other dashboard-y things, like calendar or weather forecast) on a few displays around the house.
I have a few Ansible[2] playbooks set up for running things like package- and system updates, but
I still often ssh
to the PIs for one-off tasks (in the beginning they all
look like one-offs...) To do this comfortably, I installed ZSH[3] - my shell of choice - on the PIs, using the same .zshrc
config
everywhere. Just a handful of if
s was enough to make my config portable,
which was a pleasant surprise.
One of the CLI tools I have set up in my .zshrc
is fzf
[4], which I use as a history search mechanism, replacing the
default one under Control + r
. It's incredibly useful, as it lets you
narrow down the search incrementally and interactively, with fuzzy matching,
which makes arriving at the intended command much faster than with the
built-in search. You could say I got addicted to the ease of both browsing
and searching through the shell history in an interactive manner. However,
no matter what search method you use, if something's not there, you won't
find it! That sounds trivial, but it connects to the topic of this
post: how to merge ZSH history files.
The use case should be clear by this point: very often, whatever I did on one PI, I would need to do, sooner or later, on one or two others. In such cases I, reflexively, tried to search for the needed command in the history of the current shell, which never saw that command... Obviously, I couldn't find it.
Going through a few minutes of frustration a couple times was enough for me to start thinking about an automatic solution. How nice would it be if something gathered the histories from all the hosts and merged them into a single one, I thought.
Why Scala? Groovy? JVM?
As usual: by accident. It just so happened that I was learning Scala (and
JVM in general) at the time for work, so I decided to use the urge to
automate as a material for practice. As many of us do all the time, I
thought, how hard could it be?
and started setting up a project,
which you can see on GitHub. In this post, I want to highlight nice things
about Scala that I learned in the process.
Project setup using Gradle
I'm not against the use of Scala build tool, sbt
. I chose Gradle[5] simply because I had some prior experience with it, from my previous JVM
project. That project had parts written in Java, Groovy, and Scala, and
Gradle was the first tool we found which handled this case without tons of
boilerplate code. Gradle has plugins for nearly everything, lightweight
config syntax, and good performance. When starting the current project, I
simply copy-pasted the build.gradle
from the previous one, which was the
fastest way to get started. The file looks roughly like this (click to
unfold):
You execute tasks configured in build.gradle
with gradle
command, which
works kind of like make
(or gulp
, or mix
, etc.). For example, the
plugin called 'application'
defines a task called 'installDist
' which
compiles and packages the project, giving you a script which starts your
application. To compile and run the project in one go, I ended up with a
following script, run.sh
:
It's worth noting that you can define your own tasks in build.gradle
,
using the full Groovy[6] language.
Here's an example which creates scripts for running REPLs with the whole
project on the CLASSPATH
:
Interactive shell is always nice to have, and one of the three, called Ammonite[7], is a Scala equivalent of IPython[8], and is a pleasure to work with - I used it extensively to experiment with unfamiliar libraries, for example.
At this point the setup is done. Let's start examining the implementation.
Fetching history files with SCP & Scala external commands
Recounting the problem: I have a number of hosts on the network, where I
tend to log in via ssh
- which means I have authorized_keys
properly
configured already. I would like the shell history, which is a plain text
file, to be synchronized among all the hosts. To do this, I need to fetch
the history files from the network, process them somehow and put the merged
file back on the hosts.
In this situation, using scp
[9] command seemed the easiest way of copying
files across my network. It's not Scala-specific, but there's no need to
reinvent the wheel and write custom networking code for something this
simple. Fortunately, Scala has a simple DSL for running external programs
built-in, in the sys.process
namespace (see
the docs.)
While the DSL is powerful - even chaining I/O of a sequence of programs
(like what the |
operator does in most shells) is possible - what I really
needed was just a return code. If it's equal to 0
, we know the transfer
finished successfully; anything else signifies an error. To execute a
command and get the exit code, you use !
method, implicitly added to
String
s and Seq
s:
Downloading multiple files at a time with parallel collections
Most of my PIs are connected via WiFi, some of the older ones even use the 2.5Ghz network. The transfer is not slow enough to be irritating, but that's only if you need something from a single host. If you try downloading many files in sequence, the latency will add up to the level where it's noticeable.
Scala provides a wonderful, high-level construct for parallelizing
processing: parallel
collections. Basically, whenever you have code which maps (or folds)
a function over a collection, you can transform said collection into a
parallel one. Parallel collections implement map
method which executes
given function on a pool of threads. The result is another parallel
collection, which you can transform back into a normal one, which will
have the results of computations, in the correct order. The toList
method which does this blocks and waits for all the items in all the
threads to finish being processed. It looks like this:
The interesting part is in this expression:
Which is roughly equivalent to the following Python code:
Parallel collections take care of all the bookkeeping behind the scenes, so that you can have a "scatter & gather"[10] concurrency and parallelism without having to do any scattering or gathering yourself.
One thing worth considering is if our use of parallel collections
is safe. The question arises because calling external processes
is very clearly a side-effect, and side-effects are generally bad fit for
concurrency. In this case, however, we know that each application of
downloadSingleFileFrom
gets a file from a different server and saves
that file under a path different from all the other calls. As there is
nothing shared between calls, they are safe to execute concurrently.
Parsing ZSH history file format
ZSH has two formats of history files: simple and extended. The extended one, which I use, looks like this:
Each history entry (note: not "each line", because entries can span multiple
lines) starts with a colon and space - :
- which is followed by the
timestamp in typical UNIX format, then by time taken by the command to
execute, and finally the command itself.
Commands spanning multiple lines are problematic - we can't just split the file contents on newlines to get the list of commands. Other than that, the format is not very complex and it can be parsed with a few regexes.
Personally, though, whenever I think about a few regexes - not just one or two - I tend to go for a more powerful and structured parsing tool. Scala provides a parser combinator library[11] with a DSL for defining the structure of the text to be parsed. Parser combinators are a functional way of encoding recursive descent parsing, and so are able to parse both contex-free and context-sensitive grammars; we don't need its full power here, but it's good to know we can deal with more complex formats with the same tool.
The parser is defined as an object
which extends Parsers
subclass,
RegexParsers
:
The parser is an object which defines a series of methods returning
Parser
instances. There are implicit conversions defined for String
s
and Regex
ex that convert them into Parser
s - which is why the nl
to
noNewLine
definitions work. The parsers themselves are defined as
functions which take a stream as input and return a result along with the
rest of the stream (to be parsed by following parsers.)
After defining the basic building blocks of the syntax, we define the
grammar using parser combinators: methods ~>
, <~
, and ~
. They
all express sequencing, ie. the parser created by parser1 ~ parser2
will
match whatever parser1
matches, then will try to match parser2
. The
difference between the three methods has to do with the results of
parsing: the basic ~
operator means combine the results of both
parsers into one
, while ~>
means discard results of the first
parser, return whatever the second returns
. The <~
operator
discards the results of the second parser instead.
There is an alternative operator, spelled parser1 | parser2
- the same
way as in regexps - which parses whatever either parser1
parses, or whatever parser2
parses. Another important
combinator is parser.+
, which parses repetitions of whatever parser
parses. It puts the results in a List
. Finally, we have guard(parser)
combinator, which parses whatever parser
parses, but doesn't advance the
position in the input stream. This is known as lookahead assertion
in regexes.
The last operator, parser ^^ fun
, applies fun
to the results of
parser
. Whatever fun
returns becomes the new parsing result. Quoting
the docs: If you've done yacc parsing, the left hand side of the ^^ corresponds to the
grammar rule and the right hand side corresponds to the code generated by
the rule.
This is used for cleaning up and structuring parsing
results, as you can see in the example, where the combined results of
three parsers are pattern matched on and transformed into a single
3-tuple (lines 20-22).
To actually use the parser, you do something like this:
As you can see, there are two failure modes: Failure
and Error
.
Because I don't care about the distinction, I transform the result into
Either
, with Right
meaning success. The ignored part of the patterns,
_
, is the mentioned rest of the input
, which should be empty.
This would normally work, but the parseHistory
, as it is, fails on some
history files. What is happenning, and why?
(Un)Metafying ZSH history
Turns out ZSH escapes some characters in a special way when writing and
reading history files. As far as I can tell, this mechanism is there to
make sure characters special to the shell, like $
, !
, ~
, etc., are
not interpreted/executed by accident. It unfortunately causes a problem
with string encoding; in Java this means you get the following exception
when you try reading the file:
Now, ZSH mostly can handle UTF-8 text. Or rather, UTF is defined in such a
way that it's mostly backward compatible: as long as a program doesn't do
anything special with character codes above 127, it should be able to
handle UTF-8 transparently. ZSH escaping, however, uses codes above
0x83
(that is 131
in decimal, 0b10000011
in binary) to encode its
special characters. It does this without accounting for a variable-width
encoding of UTF-8, breaking the encoding in the process.
The history file is written in the metafied
- meaning escaped -
format. To make it UTF8-clean, we need to reverse the escaping. In ZSH
source, there's a function called unmetafy,
which looks like this:
The metafication inserts a special character, Meta
, before special
character, which is then XORed with 32
. As XOR is its own inverse,
unmetafying simply removes every Meta
character encountered, and XORs
the following character with 32
again. The code is terse and efficient,
as it does this in-place - no new string is allocated. My reimplementation
in Scala is less efficient, as it creates a copy of the input string, but
it's not really an issue, given the amount of RAM available vs. the lenght
of the history file. For reference, currently my history file is 2.4Mb,
while my computer has 32Gb of RAM... Anyway, in Scala it looks like this:
One curious thing I learned, which is visible in this example, is that on
the JVM there is no unsigned byte
type. What it means is that values
above 127
are interpreted as negative; this is known as two's
complement, where the highest bit encodes a sign. What's important to note
is that the value - ie. the bits set - stays the same, it's just
the interpretation that changes. This means we can write the Array
of
Byte
s back into a file without doing anything special - we only need to
convert values over 127
to Byte
to satisfy the type checker.
Dumping merged history back into a file
After parsing, we have a number of List
s of tuples. Many - almost all,
actually - of the tuples are duplicates, which we need to remove. I do
this by joining all the lists, then sorting by timestamp (first element of
a tuple), then removing duplicates, and finally removing repeating
commands (third element of tuples) from the list. It looks like this:
As you can see, parsing also happens in parallel, thanks to parallel
collections. The main function, processHistoryFiles
, returns
Parsing.ResultData
, which is an alias for a list of triples:
List[(Long, Int, String)]
.
What's left is just dumping the results back into a file:
The code here is straightforward, one thing worth noting is line 18 - the
Using.resource()
construct[12]. Scala doesn't have try-with-resources
[13]
which is used to always free, or close, a resource, when control exits
from a block - both normally and via exception. In Python, you would use
with
statement; Scala defines a higher-order function which does the
same.
Configuration and reading JSON data
Instead of hardcoding a list of hosts, I decided to create a configuration
file, .mergerc
. It is a JSON file - mostly because I wanted to try using
JSON with Scala. The configuration format looks like this:
To read the configuration, I first created a case class, imaginatively
named Configuration
, which has the fields corresponding to JSON
attributes:
The class defines some helper methods which are not essential, so I elided
them here. Then, there's a companion object, also called Configuration
,
which handles loading the JSON data and mapping them into Configuration
instance. It looks like this:
Well, it's still longer and more verbose than a simple JSON.parse()
in
JavaScript would be, but I have to say - it's not bad, given the
statically typed nature of Scala. The magic apparently happens in the
ObjectMapper
object, which is customized for Scala with the
.registerModule(DefaultScalaModule)
call. I have no idea how that works
- probably via reflection, but I don't know Scala meta-programming well
enough to say for sure. Still, all you have to do is define a class with
fields of correct types, and you get an instance of that class with
correctly parsed values automatically. Not bad.
That's it!
That's basically all the code in the project - there are some additional utilities here and there, but I covered all the essential parts of the project. I use it regularly to merge ZSH history on a number of hosts, currently 8, and it works without a hitch. A single run takes around 30 seconds, while subsequent runs take around 7-8 seconds. Most of the time is spent downloading and uploading files, with parsing, sorting, and dumping merged history taking less than a second. This is for 9 hosts and a history file(s) of around 2.5Mb - not bad I think.
Above all else, though, I learned a lot, about Scala and the JVM, and string encodings on top of that. There were frustrating moments, but overall it was a pleasant and fun journey. I hope you enjoyed reading about it - let me know what you think about the post in the comments.
That's it - thanks for reading!
- https://www.raspberrypi.org/products/ ↵
- https://www.ansible.com/ ↵
- https://www.zsh.org/ ↵
- https://github.com/junegunn/fzf ↵
- https://gradle.org/ ↵
- https://groovy-lang.org/ ↵
- https://ammonite.io/ ↵
- https://ipython.readthedocs.io/en/stable/ ↵
- https://linux.die.net/man/1/scp ↵
- https://medium.com/@guy.grigsby/go-scatter-gather-pattern-a439c70afe16 ↵
- https://github.com/scala/scala-parser-combinators ↵
- https://www.baeldung.com/scala/try-with-resources ↵
- https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html ↵