Parsing JSON in Forty Lines of Awk

125 points by thefilmore 3 days ago

latexr 2 days ago

> Another option is to use Python, which is ubiquitous enough that it can be expected to be installed on virtually every machine

Not on macOS. You can do it easily by just invoking /usr/bin/python3, but you’ll get a (short and non-threatening) dialog to install the Xcode Command-Line Developer Tools. For macOS environments, JavaScript for Automation (i.e. JXA, i.e /usr/bin/osascript -l JavaScript) is a better choice.

However, since Sequoia, jq is now installed by default on macOS.

ameliaquining 2 days ago

I feel like the takeaway here is that jq should probably be considered an indispensable part of the modern shell environment. Because without it, you can't sensibly deal with JSON in a shell script, and JSON is everywhere now. It's cool that you can write a parser in awk but the point of this kind of scripting language is to not have to do things like that.

HappMacDonald 2 days ago

One day I wanted to use a TAP parser for the Test Anything Protocol.

But I didn't want to be bogged down by dependencies.. so I didn't want to go nowhere near Python (and pyenv.. and anaconda.. and then probably having to dockerize that for some reason too..) nor nodeJS nor any of that.

Found a bash shell script to parse TAP written by ESR of all people. That sounds fine, I thought. Most everywhere has bash, and there are no other dependencies.

But it was slow. I mean.. painfully, ridiculously slow. Parsing like 400 lines of TAP took almost a minute.

That's when I did some digging and learned about what awk really is. A scripting language that's baked into the POSIX suite. I had heard vaguely about it beforehand, but never realized it had more power than it's sed/ed/cut/etc brethren in the suite.

So I coded up a TAP parser in awk and it went swimmingly, matched the ESR parser's feature set, and ran millions of lines in less than a second. Score! :D

theamk 2 days ago

For the record, "python-without-extra-dependencies" is a thing and a very nice one too. I always prefer it over awk.
Highly recommend to everyone - plenty of "batteries" included, like json parser, basic http client and even XML parser, and no venv/conda required. Very good forward compatibility. Fast (compared to bash).
- HappMacDonald a day ago
  
  That does sound compelling, but I frequently enough have to work on embedded systems (often running busybox) where python isn't installed and available R/W storage space is measured in tens or hundreds of Kb.
  I find that that is one more environment where awk scripting can get the job done, python/perl/php/etc just can't be introduced, bash can _sometimes_ get the job done if it doesn't have to spawn too many subprocesses, and C/other-compiled-options _might_ be able to help if I had some kind of build environment targeting the platform(s) in question and enough patience.
  I'll keep an eye out for python with no extra dependency options on the platforms that can handle that though.
  
  theamk a day ago
  
  Lua is pretty popular in the embedded devices I've been working on.. of course it's stdlib is tiny compared to python's though.
- wintergrasp 2 days ago
  
  Can you explain more or provide more information/links?
  
  stephenlf 2 days ago
  
  The Python Standard Library
  https://docs.python.org/3/library/index.html

twoodfin 2 days ago

Contemplating this, it’s too bad the Unix scripting ecosystem never evolved a tripartite symbiosis of ‘file‘, ‘lex‘, and ‘yacc‘, or similar tools.

That is, one tool to magically identify a file type, one to tokenize it based on that identification, one to correspondingly parse it. All in a streaming/pipe-friendly mode.

Would fit right in, other than the Unix prejudice (nonsensical from Day 0) for LF-separated text records as the “one true format”.

kristopolous 2 days ago

I guess ... I don't think it really gets you much unless it's 1-pass streaming otherwise we're dealing with entire input buffers and then we're just back to files.
You could argue using the vertical separator | is more syntactically graceful but then it's just a shell argument. There's quite a few radically different shells out there these days like xonsh, murex, and nushell so if simply arranging logic on the screen in a different syntax is what you're looking for then that's probably the way.
- twoodfin 2 days ago
  
  What I meant was something like the SAX streaming parse model disaggregated and broken into sniff / lex / parse phases all mediated by a modestly structured stream of the kind a tool like awk could process naturally.
  
  kristopolous 2 days ago
  
  Right okay I ran into a very similar limitation in Unix recently.
  It's really about multiplexed semantic routing.
  There's no great tools, abstractions, jargon or syscalls for it.
  I mean there's plenty for things like the networking stack, dbus, Wayland, and various audio stacks but they're all a pain to deal with.
  Nobody has really figured it out. The list inputs, list outputs, create bridges and taps, assign properties, which is what everybody does is complicated.
  I'm sure there's brilliant people that find this intuitive but for me it requires too much orchestration and feels brittle.
  People see this in microservices as well.
  There needs to be something as intuitive as the mouse drag (Raskin), Unix pipe (McIlroy) or drop down menu (Atkinson) to deal with this stuff.
  All these "obvious" things had to be invented. There's something dumb and obvious here nobody's cracked yet
  The primary problem is the concept of discovering such things has ossified and few people are experimenting on introducing new ones

chaps 3 days ago

Awk is great and this is a great post. But dang, awk really shoots itself so much with its lack of features that it so desperately needs!

Like: printing all but one column somewhere in the middle. It turns into long, long commands that really pull away from the spirit of fast fabrication unix experimentation.

jq and sql both have the same problem :)

thrwwy9234 3 days ago
```
  $ echo "one two three four five" | awk '{$3="";print}'
  one two  four five
```
- chaps 3 days ago
  
  Oh dang, that's good.
  
  shawn_w 2 days ago
  
  As long as you don't mind the extra space in the middle.
  
  chaps 2 days ago
  
  Often times I don't! Entirely depends on what I'm doing. #1 thing off the top of my head is to remove That One Column that's a bajillion characters long that makes exploratory analysis difficult.
  
  saghm 2 days ago
  
  I wonder if it's possible to encode the backspace character in the replacement string?
  
  ZoomZoomZoom 2 days ago
  
  No problem, there's \b.
  echo "one two three four five" | awk '{$3="\b"; print}'
  Inserts the backspace character (^H), which you can then remove with [global] substitution:
  awk '{$3="\b"; sub(/\32\b/, ""); print}'
  You can, of course, use an arbitrary sentinel value for the field to be deleted. Should work in gawk and BWK awk.
  
  fragmede 2 days ago
  
  if you do:
  sed 's/ / /g'
  
  lucb1e 2 days ago
  
  Or sticking with awk, I have this bash alias to remove excess whitespace that is just:
  awk '{$1=$1};1'
  
  fuzztester 2 days ago
  
  what does the 1 at the end do? make awk print all lines? I'm a bit rusty with my awk.
  
  thrwwy9234 2 days ago
  
  It’s a condition. 1 is true-ish, so it’s a condition that is always true. The default action in awk is print, so it’s the same as:
  '{$1=$1}1{print}'
  Or the same as
  '{$1=$1}{print}'
  Since the default condition is true. But 1 is shorter than {print}.
SoftTalker 3 days ago

> awk really shoots itself so much with its lack of features that it so desperately needs
Whence perl.
- librasteve 2 days ago
  
  or raku
  https://github.com/moritz/json/blob/master/lib/JSON/Tiny/Gra...
  
  saghm 2 days ago
  
  I suspect the rationale for Perl is that most Linux systems will probably have it installed already. Installing something you're familiar with is great when you can, but I'm guessing the awk script linked to here was picked more for its ubiquity than elegance.
  
  chaps 2 days ago
  
  Kinda, but not really. Of the infrastructures I've worked on, not a single one has been consistent in installing perl on 100% of hosts. The ones that get close are usually like that because one high up person really, really likes perl. And they send a lot of angry emails about perl not being installed.
  Within infrastructures where perl is installed on 95% of hosts, that 5% really bites you in the ass and leads to infrastructure rot very quickly. You're kinda stuck writing and maintaining two separate scripts to do the same thing.
  
  SoftTalker 2 days ago
  
  Same with Python. It's mostly available, but sometimes not.
  With Perl, I find that a base installation is almost always available, but many packages might not be.
  
  chaps 2 days ago
  
  I dunno about that. IME, python is much, much more universally installed on the hosts I've worked on. Sure, usually it's 2.7, but it's there! I've tended to work on rhel and debian hosts, with some fedora in the mix.
  (Once had a coworker reject a PR I wrote because I included a bash builtin in a deployment script. He said that python is more likely to be installed than bash, so we should not use bash. These debates are funny sometimes.)
  
  saghm 16 hours ago
  
  Interesting, in my experience perl ends up pulled in as a dependency for one thing or another most of the time, but I don't have that perception about Python. Maybe there's just something I use that pulls in perl without me realizing and it's biased my experience.
- heresie-dabord 2 days ago
  
  Using the Perl module JSON::PP, the programme json_pp is installed.
  The following command is handy for grepping the output:
  cat mydata.json | json_pp
toddm 2 days ago
Us old UNIX guys would likely go for cut for this sort of task:
```
     cut -d " " -f1-2,4-5 file.txt
```
where file.txt is:
```
     one two three four five
```
and the return is:
```
     one two four five
```
jcynix 3 days ago

>awk really shoots itself so much with its lack of features that it so desperately needs!
That's why I use Perl instead (besides some short one liners in awk, which in some cases are even shorter than the Perl version) and do my JSON parsing in Perl.
This
diff -rs a/ b/ | ask '/identical/ {print $4}' | xargs rm
is one of my often used awk one liners. Unless some filenames contain e.g. whitespace, then it's Perl again
- 8n4vidtmkvmk 2 days ago
  
  I've been using perl instead of sed because PCRE is just better and it's the same regex that PHP uses which I've been coding in for nearly 20 years. I still don't actually know perl, but apparently Gemini does. It wrote a particularly crazy find and replace for me. Never got around to using or learning awk. Only time I see it come up is when you want to parse some tab delimited output
- ptspts 3 days ago
  
  This is much safer: xargs -d '\n' rm -f --
  
  jcynix 3 days ago
  
  Sure, but my example was just that and I actually use /identical$/ as the pattern. Sorry for the typo.
  And I use this "historic" one liner only when I know about the contents of both directories. As soon as I need a "safer" solution I use a Perl script and pattern matching, as I said.
  
  poke646 2 days ago
  
  In this case the Perl one-liner would be conceptually identical, the same length, but more performant (no calling out to rm):
  diff -rs a/ b/ | perl -ane '/identical$/ && unlink $F[3]'
mauvehaus 3 days ago

...And once you get away from the most basic, standard set of features, the several awks in existence have diverging sets of additional features.
- chaps 3 days ago
  
  Things are already like that, friend! We have mawk, gawk and nawk. But it's fun to think about how we could improve our ideal tooling if we had a time machine.

chubot 3 days ago

JSON is not a friendly format to the Unix shell — it’s hierarchical, and cannot be reasonably split on any character

Yes, shell is definitely too weak to parse JSON!

(One reason I started https://oils.pub is because I saw that bash completion scripts try to parse bash in bash, which is an even worse idea than trying to parse JSON in bash)

I'd argue that Awk is ALSO too weak to parse JSON

The following code assumes that it will be fed valid JSON. It has some basic validation as a function of the parsing and will most likely throw an error if it encounters something strange, but there are no guarantees beyond that.

Yeah I don't like that! If you don't reject invalid input, you're not really parsing

---

OSH and YSH both have JSON built-in, and they have the hierarchical/recursive data structures you need for the common Python/JS-like API:

    osh-0.33$ var d = { date: $(date --iso-8601) }

    osh-0.33$ json write (d) | tee tmp.txt
    {
      "date": "2025-06-28"
    }

Parse, then pretty print the data structure you got:

    $ cat tmp.txt | json read (&x)

    osh-0.33$ = x
    (Dict)  {date: '2025-06-28'}

Create a JSON syntax error on purpose:

    osh-0.33$ sed 's/"/bad/"' tmp.txt | json read (&x)
    sed: -e expression #1, char 9: unknown option to `s'
      sed 's/"/bad/"' tmp.txt | json read (&x)
                                     ^~~~
    [ interactive ]:20: json read: Unexpected EOF while parsing JSON (line 1, offset 0-0: '')

(now I see the error message could be better)

Another example from wezm yesterday: https://mastodon.decentralised.social/@wezm/1147586026608361...

YSH has JSON natively, but for anyone interested, it would be fun to test out the language by writing a JSON parser in YSH

It's fundamentally more powerful than shell and awk because it has garbage-collected data structures - https://www.oilshell.org/blog/2024/09/gc.html

Also, OSH is now FASTER than bash, in both computation and I/O. This is despite garbage collection, and despite being written in typed Python! I hope to publish a post about these recent improvements

alganet 3 days ago

> Yes, shell is definitely too weak to parse JSON!
Parsing is a trivial, rejecting invalid input is trivial, the problem is representing the parsed content in a meaningful way.
> bash completion scripts try to parse bash in bash
You're talking about ble.sh, right? I investigated it as well.
I think they made some choices that eventually led to the parser being too complex, largely due to the problem of representing what was parsed.
> Also, OSH is now FASTER than bash, in both computation and I/O.
According to my tests, this is true. Congratulations!
- akinomyoga 2 days ago
  
  > I think they made some choices that eventually led to the parser being too complex, largely due to the problem of representing what was parsed.
  No, the complexity of the parser can be attributed to the incremental parsing. ble.sh implements an incremental parser where one can update only the necessary parts of the previous syntax tree when a part of the command line is modified. I'd probably use the same data structure (but better abstracted using classes) even if I could implement the parser in C or in higher-level languages.
  
  alganet 2 days ago
  
  That makes sense, thanks for clarifying it!
- chubot 2 days ago
  
  I was referring to the bash-completion project, the default on Debian/Ubuntu - https://github.com/scop/bash-completion/
  But yes, ble.sh also has a shell parser in shell, although it uses a state machine style that's more principled than bash regex / sed crap.
  ---
  Also, distro build systems like Alpine Linux and others tend to parse shell in shell (or with sed).
  They often need package metadata without executing package builds, so they do that by trying to parse shell.
  In YSH, you will be able to do that with reflection, basically like Lisp/Python/Ruby, rather than ad hoc parsing.
  ---
  I'm glad to hear you can see the effect of the optimizations ! That took a long time :-)
  Some more benchmarks here, which I'll write about: https://oils.pub/release/0.33.0/benchmarks.wwz/osh-runtime/
  
  alganet 2 days ago
  
  > uses a state machine style
  That's the way to go. I don't even consider other shallow and ad-hoc approaches as actually parsing it.
  I've been working on a state-machine based parser of my own. It's hard, I'm targetting very barebones interpreters such as posh and dash. Here's what it looks like
  https://gist.github.com/alganet/23df53c567b8a0bf959ecbc7b689...
  (not fully working example, but it gives an idea of what pure POSIX shell parsing looks like, ignore the aliases, they'll not be in the final version).
  > I'm glad to hear you can see the effect of the optimizations ! That took a long time :-)
  Yep, been testing osh since 0.9! Still a long way to go to catch up with ksh93 though, it's the fastest of all shells (even dash) by a wide margin.
  By beating bash, you also have beaten zsh (it's one of the slowest shells around).
  
  chubot a day ago
  
  Yes thanks for testing it. I hope we can get to "awk/Python speed" eventually, but I'm happy to have exceeded "bash and zsh speed"!
  And I did notice that zsh can be incredibly slow -- just the parser itself is many times slower than other shells
  A few years ago a zsh dev came on Zulip and wished us luck, probably because they know the zsh codebase has a bunch of technical debt
  i.e. these codebases are so old that the maintainers only have so much knowledge/time to improve things ... Every once in awhile I think you have to start from scratch :)
  
  cb321 2 days ago
  
  You may well already be aware, but just in case you aren't, your bin-true benchmark mostly measures dynamic loader overhead, not fork-exec (e.g., I got 5.2X faster using a musl-gcc statically linked true vs. glibc dynamic coreutils). { Kind of a distro/cultural thing what you want to measure (static linking is common on Alpine Linux, BSDs, less so on most Linux), but good to know about the effect. }
  
  chubot 2 days ago
  
  Yup, I added an osh-static column there, because I know dynamic linking slows things down. (With the latest release, we have a documented build script to make osh-static, which I tested with GNU libc and musl: https://oils.pub/release/latest/doc/help-mirror.html)
  Although I think the CALLING process (the shell) being dynamically linked affects the speed too, not just the CALLED process (/bin/true)
  I'd like to read an analysis of why that is! And some deeper measurements
  
  cb321 2 days ago
  
  The calling process being dynamically linked might impact fork() a lot to copy the various page table setups and then a tiny bit more in exec*() to tear them down. Not sure something like a shell has vfork() available as an option, but I saw major speed-ups for Python launching using vfork vs. fork. Of course, a typical Python instance has many more .so's linked in than osh probably has.
  One could probably set up a simple linear regression to get a good estimate of added cost-per-loaded .so on various OS-CPU combos, but I am unaware of a write up of such. It'd be a good assignment for an OS class, though.
packetlost 3 days ago

I don't really buy that shell / awk is "too weak" to deal with JSON, the ecosystem of tools is just fairly immature as most of the shells common tools predate JSON by at least a decade. `jq` being a pretty reasonable addition to the standard set of tools included in environments by default.
IMO the real problem is that JSON doesn't work very well at as a because it's core abstraction is objects. It's a pain to deal with in pretty much every statically typed non-object oriented language unless you parse it into native, predefined data structures (think annotated Go structs, Rust, etc.).
- comex 2 days ago
  
  I'd say that awk really is too weak. Awk has a grand total of 2 data types: strings, and associative arrays mapping strings to strings. There is no support for arbitrarily nested data structures. You can simulate them with arrays if you really want to, or you could shell out to jq, but it's definitely swimming upstream.
  Most languages aren't quite that bad. Even if they can't handle JSON very ergonomically, almost every language has at least some concept of nesting objects inside other objects.
  What about shell? Just like awk, bash and zsh have a limited number of data types (the same two as awk plus non-associative arrays). So arguably it has the same problem. On the other hand, as you say, in shell it's perfectly idiomatic to use external tools, and jq is one such tool, available on an increasing number of systems. So you may as well store JSON data in your string variables and use jq to access it as needed. Probably won't be any slower than the calls to sed or awk or cut that fill out most shell scripts.
  Now, personally, I've gotten into the habit of writing shell scripts with minimal use of external tools. If you stick to shell builtins, your script will run much faster. And both bash and zsh have a pretty decent suite of string manipulation tools, including some regex support, so you often don't actually need sed or awk or cut. However, this also rules out jq, and neither shell has any remotely comparable builtin.
  But you might reasonably object that if I care about speed, I would be better off using a real programming language!
Brian_K_White 2 days ago

The same author already had made the more thorough jawk. They explicitly said they wanted a cut down version. It's not illegal to want a cut down version of something.

izabera 2 days ago

did something along those lines a while ago https://github.com/izabera/j trying to keep an interface similar to jq

wutwutwat 3 days ago

Your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should.

teddyh 3 days ago

“except Unicode escape sequences”

Brian_K_White 2 days ago

The same author also made jawk which is more thorough. This is explicitly stated to be a cut down version.

cAtte_ 2 days ago

just use nushell man

1vuio0pswjnm7 2 days ago

I use flex. Faster than awk or python. It produces relatively seems to be available "everywhere" because it is a build requirement for so many software programs. For exampole, NetBSD toolchain includes it. It is a build requirement for Linux kernel.^1 I have even used it with Interix SFU on Windows before WSL existed.

I do not use jq. Too complicated for me. Overkill. I created statically-linked program less than half the size of official statically-linked jq that is adequate for own needs. flex is a build requirement for jq.

1. https://www.kernel.org/doc/Documentation/admin-guide/quickly...

1vuio0pswjnm7 2 days ago

Using C program generated with flex is faster than AWK or Python. Flex is required to compile the awk interpreter, as well as the jq interpreter. I use a custom scanner generated by flex to process JSON.