Show HN: I wrote a book on GNU grep and ripgrep

182 points by asicsp 6 years ago

My book on "GNU grep and ripgrep" is free to download today and tomorrow [1][2]

Code snippets, example files and sample chapters are available on GitHub [3]

The book uses plenty of examples and regular expressions are also covered from scratch. The book is suitable for beginners as well as serves as a reference. Hope you find it useful, I would be grateful for your feedback and suggestions.

I used pandoc+xelatex [4] to generate the pdf.

[1] https://gumroad.com/l/gnugrep_ripgrep

[2] https://leanpub.com/gnugrep_ripgrep

[3] https://github.com/learnbyexample/learn_gnugrep_ripgrep

[4] https://learnbyexample.github.io/tutorial/ebook-generation/c...

etaioinshrdlu 6 years ago

For fun casual reading, try comparing the source code of any BSD utility vs. GNU.

BSD tail: https://searchcode.com/codesearch/view/457515/

GNU tail: https://github.com/coreutils/coreutils/blob/master/src/tail....

Usually the BSD one is short and sweet and the GNU one is a bit complex.

GNU utilities may be high performance but they tend to be hard to understand.

acuozzo 6 years ago

> but they tend to be hard to understand
It's not just about performance. It's also about portability. GNU utilities are intended to work on various nixen rather than just GNU/Linux.
I mean, hell, the GNU tail you link to includes an explicit workaround for odd select() behavior on AIX.

iamnotacrook 6 years ago

This should be run past an editor. The text is missing articles (a, an, the) all over. Something obviously got lost in translation. It's fine for a blog or comment on HN or whatever but if you're asking for money...

kirke 6 years ago

You know you really don't need to tear down his work because you don't like the way it reads.
Lots of the missing articles are around command line arguments, where the correct usage is ambiguous anyway.
I don't find it distracting. The content of these books is an excellent resource on an immeasurably useful topic.
As for "... but if you're asking for money...," this post is literally giving away his work. The consumer is given the option to pay nothing, or whatever is reasonable according to their own valuation.
If you as the consumer value the work less because it fails to conform to language rules, then you are actually invited to decrease the compensation given by you to the author.
sa46 6 years ago

Somewhat off topic, how would one go about getting good technical editing for something like a blog post or for a book like the OP? Is it expensive?
- timClicks 6 years ago
  
  Post an add for a "development editor" who specializes in technical writing on upwork/similar. You get what you pay for.
m463 6 years ago

I've always wondered why book reader software didn't have a "hightlight and submit typo feedback" function right up front.
Of course, that might lead to "crowdsourced" editing, which might be a horrible thing to create.

rmbryan 6 years ago

Please please please hire an editor. You have good content that needs editing.

asicsp 6 years ago

Yep, this is a consistent feedback I've got for my books. I'll try to get them edited, at least the obvious mistakes like missing articles.
- apkallum 6 years ago
  
  Hello, I'm happy to edit this for free or lunch in Eastern Canada - email in profile.
- dingo_bat 6 years ago
  
  I feel like many of these mistakes could be picked up just by running the text through MS Word and looking at the squiggly lines.

dewey 6 years ago

Slightly related, I only recently found out that the grep (BSD) included in macOS is way slower than the GNU equivalent that you can install from Homebrew. If you ever had to work with big files and were confused why it's so much slower give it a try.

http://jlebar.com/2012/11/28/GNU_grep_is_10x_faster_than_Mac...

teraflop 6 years ago

Related, from 2010: "why GNU grep is fast" (https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...)
opencl 6 years ago

And ripgrep is quite a lot faster than GNU grep[1], especially if you need unicode support.
[1] https://blog.burntsushi.net/ripgrep/
- dewey 6 years ago
  
  It depends, I benchmarked all of the usual suspects (fgrep, GNU grep, BSD grep, ripgrep, sift grep) on the same 30GB line separated JSON file last week and GNU grep was the fastest one.
  I didn't have a scientific testing setup though. I just ran them one by one like this:
  time cat title_complete_json.txt | grep tt9184994 >> result
  
  opencl 6 years ago
  
  Yeah, rg is definitely not always the fastest. Test for your own use case and all.
  In pretty much any case involving more complicated queries and/or unicode I have found rg to be the fastest by a very wide margin. Also if you're searching multiple files rg is multithreaded by default without having to pass through something like xargs or parallel. If you're just searching for an ASCII string literal in a single file rg probably gets slowed down by its UTF-8 validation.
  
  burntsushi 6 years ago
  
  Thanks for noticing ripgrep's performance on Unicode. :-)
  > If you're just searching for an ASCII string literal in a single file rg probably gets slowed down by its UTF-8 validation.
  This is most definitely wrong. Firstly, ripgrep doesn't do any UTF-8 validation. That would make it much much slower. Secondly, searching a simple literal is one of the cases where ripgrep should actually do better than GNU grep in many cases. This is because of the slightly smarter way in which ripgrep keeps its vectorized loop active more often than GNU grep does by trying to search rarer bytes. I talk about this quite a bit here, which even includes a benchmark where I measured ripgrep at 2x the speed of GNU grep for a simple literal search: https://blog.burntsushi.net/ripgrep/#subtitles-literal
  Anyone can try this for themselves:
  $ cd /tmp $ curl 'https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/mono/en.txt.gz' | gzip -cd > subtitles.en.txt $ pv < subtitles.en.txt > /dev/null 9.28GiB 0:00:01 [4.89GiB/s] [======================================================================================================================================================>] 100% $ time grep ZQZQZQZQ < subtitles.en.txt real 8.713 user 7.495 sys 1.212 maxmem 9 MB faults 0 $ time rg ZQZQZQZQ < subtitles.en.txt real 1.857 user 0.573 sys 1.282 maxmem 9 MB faults 0
  On my system, /tmp is a ramdisk, so this doesn't factor in I/O. If you look up this thread, you'll note that someone was trying to search a 30GB file. Unless that person could fit that file into memory, it's very likely that they were just measuring the I/O bandwidth of their machine.
  
  opencl 6 years ago
  
  Thanks for the detailed explanation and your work on rg, I use it all the time and strongly appreciate the performance with unicode because I do a lot of searching of large directories of files that are largely in Japanese.
  
  burntsushi 6 years ago
  
  No problem! If it's not too much trouble and you can share your data (and queries), I'm always happy to see more details about use cases that take advantage of ripgrep's Unicode support and its associated performance characteristics. At some point, I'm going to have to redo my previously published benchmarks, and it would be great to get some real use cases like yours in there. :-)
  
  opencl 6 years ago
  
  A lot of the data I work on is unfortunately NDAed but I'll try to find something that can be shared, no guarantees though. What's a good way to contact you with this stuff?
  
  burntsushi 6 years ago
  
  Thank you! Just file an issue and that should be good enough!
  
  kodz4 6 years ago
  
  You have to throw parallels in there - https://stackoverflow.com/questions/9066609/fastest-possible...
  
  mihaitodor 6 years ago
  
  Did you try The Silver Searcher https://github.com/ggreer/the_silver_searcher? :)
  
  dewey 6 years ago
  
  I didn't, but will try it out of curiosity when I'm back on that machine.
  
  davemp 6 years ago
  
  AFAIK ripgrep always uses regular expressions. So for straight substring matching grep will usually be faster.
  
  burntsushi 6 years ago
  
  It does not. Could you please say where you got this impression so I could fix it? See my other comments in this very thread. ripgrep actually has more sophisticated literal extraction & searching than GNU grep does. Moreover, many of these optimizations are inside the regex engine itself, so even if it looks like you're using a regex, literal optimizations may still be applied.
  
  davemp 6 years ago
  
  My mistake. I read the first line of the ripgrep readme and assumed because utils like `find` run slower with regex matching for the same searches that ripgrep would be similar.
  > ripgrep is a line-oriented search tool that recursively searches your current directory for a regex pattern
  I'm not sure if many other people would jump to similar conclusions. So the readme might not really need a revision.
  
  Someone 6 years ago
  
  What’s the rationale for using >> ?
  I would try to get as small an output as possible, and not hit the file system at all, for example by using
  time cat title_complete_json.txt | grep tt9184994 | grep -v tt9184994
  
  Scarbutt 6 years ago
  
  The infamous "useless use of cat"
  
  hyper_reality 6 years ago
  
  It annoys me whenever somebody brings this up. People do this because it's easier to modify commands at the end of your prompt than near the beginning. Especially in this case, where the grandparent wants to quickly switch between benchmarking grep, ripgrep etc.
  Not useless at all.
  
  naniwaduni 6 years ago
  
  Though a neat thing to know is that redirections can appear anywhere on the line in Bourneish shells:
  time < title_complete_json.txt grep tt9184994
  As it happens, when benchmarking grep-like tools on large files, this is actually somewhat likely to make a significant difference since the "useless" use of cat actually forces the file to be read sequentially as a stream (because it's piped through), while a process receiving a file on standard input can treat it as a file and e.g. mmap it.
  
  hyper_reality 6 years ago
  
  I found a benchmark that suggests the performance gap between a pipe and input redirection is insignificant for most workloads, but there would indeed start to be a difference for the exact case discussed here (grepping a large file): http://oletange.blogspot.com/2013/10/useless-use-of-cat.html
  Thanks for your tip, it's clearly the best of both worlds for this purpose.
  
  burntsushi 6 years ago
  
  Incidentally, ripgrep will actually be faster in this case generally (on Linux at least) when given an explicit file path, since that lets it use memory maps. For example:
  $ time rg QZQZQZQZ < subtitles.en.txt real 1.842 user 0.552 sys 1.287 maxmem 9 MB faults 0 $ time rg QZQZQZQZ subtitles.en.txt real 1.210 user 0.776 sys 0.433 maxmem 9510 MB faults 0
  (See my other comments in this thread for where to get subtitles.en.txt.)
  
  nsajko 6 years ago
  
  To clarify, it would have been prettier (conceptually less clunky and shorter) to run this:
  time grep tt9184994 < title_complete_json.txt >> result
  
  catucia 6 years ago
  
  grep takes a file directly, so this could simply be:
  time grep tt9184994 title_complete_json.txt >> result
  
  adrusi 6 years ago
  
  That's relevant in some scripts, where unnecessary forks might actually add up to some meaningful resource usage. Not relevant when typed at a tty.
  
  mnbvkhgvmj 6 years ago
  
  Surely using cat like this is more in line with the unix philosophy?
s_dev 6 years ago

This is true of quite a few tools. https://www.topbug.net/blog/2013/04/14/install-and-use-gnu-c...
asicsp 6 years ago

I've also seen questions on unix.stackexchange like [1][2] that BSD grep has been buggy for some common usecases
[1] https://unix.stackexchange.com/questions/398223/grep-strange...
[2] https://unix.stackexchange.com/questions/352977/why-does-thi...

otterpro 6 years ago

I just finished reading the book, and I found it very helpful especially for beginner grep/ripgrep users. Thank you for making it available. I love the Exercise section.

Are there plan on making the solutions to the exercise available? I'm stuck on this question:

    Ripgrep
    a) For sample.md input file, match all lines containing ruby irrespective of case, but not if 
    it is part of code blocks that are bounded by triple backticks.

asicsp 6 years ago

Thanks for the wonderful feedback, makes me happy :)
I'd prefer to help readers with hints rather than making solutions public. Give it your best shot and if you are stuck with no where to go, you can contact me via email/twitter mentioned in Preface chapter.
Hint for this question: there are multiple possible solutions. I used -P option for its skipping feature (see 'Perl Compatible Regular Expressions' chapter) to avoid the code blocks.

cryo 6 years ago

Enjoyed the example chapter very much and just ordered via Gumroad.

Thanks, already learned a few new things which will be handy in future.

jrumbut 6 years ago

Very nice work!

What was your writing process, and did you make any arrangements before getting started?

One of my forever back-burner projects is a deep(ish) dive like this into one of the tools that a lot of us kind of know how to use. I'm glad to see you went and did it!

asicsp 6 years ago

Thanks.
My case is kinda weird one. I left my job after just 6 years in corporate world, due to various reasons. To cut a long story short, I went through depression, trying various things which didn't work out, etc. I conduct basic cli/scripting workshops at college, so I tried improving my materials - got active on stackoverflow, reddit, etc and started maintaining repos on github [1] It grew and finally last year I got the courage to try and self publish my tutorials collected for 2+ years in the form of books.
'deep dive' is a nice way to sum it up, it is amazing that there's so much to write about a tool. I knew probably about 5-10 grep options when I was at my job, and didn't know regular expressions that well. So, being able to write a book with confidence is such a satisfying experience. I still have plenty to improve (as can be observed from feedback in this thread) and I look forward to writing many more books.
One simple advice I can give you is start maintaining notes for the tool you are interested in. May be blog post or a repo, that way you can share with others and get feedback. Keep adding one command at a time and in a few months you'll have plenty to write about. Good luck!
[1] https://github.com/learnbyexample

chris_wot 6 years ago

I purchased it for a guy I know who is new to Unix overall. I spent some money (not a lot, sorry).

dingo_bat 6 years ago

Why are you giving it away for free? Just curious, not trying to troll.

asicsp 6 years ago

Many reasons. Mainly to get feedback and reviews as mentioned by another user. I don't have a prominent blog or social media account, so getting users is quite difficult for me. I'm inspired by people like Al Sweigart and Allen Downey who provide excellent content on Python for free. I'd also like to open source my books sometime in future.
sachdevap 6 years ago

It's a common model recently to get word of mouth publicity for the book, to get some reviews to get the book off the ground.
- chris_wot 6 years ago
  
  I paid, mostly because you were actually giving it away for free.

rambojazz 6 years ago

If one writes about free software, using only free tools and free documentation, I'm disappointed when I see that the end product is not free. Just saying.

asicsp 6 years ago

maybe others could afford to do so, right now books are my main source of income and I am still burning through my savings
also, do you mean that one shouldn't make money by using free software which gives the freedom to make money?
- rambojazz 6 years ago
  
  No absolutely not. I wasn't talking of money, I was talking of free license. You used software released with a free license, and documentation available with a free license, to write about software released with a free license, but your end product is a work without a free license. All I'm saying is that you took all the benefits that come with free licenses, but didn't contribute back with a free licensed work. You can do that of course, and again I'm not talking about money. It's just disappointing in a sense.
  
  asicsp 6 years ago
  
  Sorry, we'll have to agree to disagree, and money does play a huge role when your living depends on what you create. The free resource gives you freedom to choose but you are disappointed when people do so, then what is the point of the free license you talk about?
  Programming languages like Python and Ruby are also free software that come with documentation. Should books written for those languages also come with free license?
  
  rambojazz 6 years ago
  
  I don't understand why you are so defensive. I repeat myself: I'm not talking about money, you can sell for how much you want and I never suggested that you should work for free. It's almost like you're not reading what I'm writing. Anyway, it sounds like you have a very opportunistic idea of what is a free license, like "what it matters to me is only that I can use this material without paying". You are entitled to that, I said this in my prev comment as well. What is disappointing is the thought that somebody would consider these free resources as something merely to be exploited at no cost. If somebody has worked to prepare a free buffet you don't just take everything for yourself because it's free anyway, then tell to the other people that they have no reason to complain because you did nothing wrong. You can, but what kind of person does that? I don't know if this is the best example, but that's the feeling.