Find, Filter, and Count Text: How to become a badass with Grep

Last updated: 2021-12-22

💻 Geeky Vibes! Check out our partner WrinkledT: Sustainable Style With a POP 💥 🌈 for the Soul (and the Planet!) 🌎💚

WrinkledT: Sustainable Style With a POP 💥

grep is a command-line tool for searching line matching a regular expression pattern. In Unix-like systems, like Linux or macOS, everything is a file, and more precisely everything is a stream of bytes. A file is a collection of bytes that you can read and/or write. A reference to such file is called a file descriptor (fd). This approach allow us to use the same set of system calls to access a given resource, and subsequently using the same set of tools, like grep.

Write programs to handle text streams, because that is a universal interface.
— Doug Mcillroy, Unix pioneer

If you know how to handle text streams, you can make your Linux life a lot easier.

Of these great text-processing tools, grep is one of the oldest. Everything it does is based on finding regular patterns in lines of text, and printing them. Yet, despite its age and its simplicity, grep provides you a large amount of power and flexibility.

As you learn to use grep, you’ll find more and more problems that have easy grep solutions.

Here are some typical uses of grep:

Finding where an expression occurs in a file or directory
Counting how many times an expression occurs
Filtering output from a different program.

grep combines very well with other programs. The last part of this how-to section gives you an example of how to use grep to make a simple reporting tool.

Tutorials: searching for text patterns in Romeo and Juliet

Finding text patterns is probably the most common use of grep.

First, we need a source file to work with. Any text that has multiple lines is a good place to start. How about a play by Shakespeare? (Server logs are a more traditional example).

If you want to follow along, you can use Project Guttenberg’s public-domain version of Romeo and Juliet. You can download the file from your shell using curl, like this:

curl -o romeo-and-juliet.txt https://www.gutenberg.org/files/1513/1513-h/1513-h.htm

Now, let’s define a goal. In the play, “Nurse” is an important person. Our task is to find the answer to these questions:

How many times does the word “nurse” get spoken in the play?
How many times does the person, “Nurse,” speak?

Tutorial 1. Find all lines where the word “nurse” is spoken

Inspect the file for patterns.
To make precise queries, it helps to know something about the structure of the file. A quick look at romeo-and-juliet.txt tells us that:
- A person’s name is introduced in all capitals
- A person’s speech begins with their name in all-caps followed by a period, e.g. ROMEO.
- All dialogue is separated by a blank line
- Stage direction is written in brackets with underscores, e.g. [\_Exit Nurse._]
Find a simple expression.
Let’s look for the expression nurse:
```
grep "nurse" romeo-and-juliet.txt
```
Surprisingly, this yields only two results, a very small number for a major character!
But, the results make sense. In this play, “Nurse” is a proper name. So, the character’s name will start with a capital N.
Search again with proper capitalization:
```
grep "Nurse" romeo-and-juliet.txt
```
Now we’ve got many more results.
Search for multiple case patterns
If you want to find all times that the word “Nurse” appears in the play’s dialogue, the first obvious thing to do would be a case-insensitive search.
grep has an -i flag, which lets you search for all cases.
```
grep -i "Nurse" romeo-and-juliet.txt
```
But wait─now we have too many results! Many of the output lines just say NURSE. These lines indicate who’s speaking the dialogue.
For now, we want to find only lines where the word “nurse” is spoken inside the text. That is, we want to print lines with the expressions “Nurse” and “nurse”
Use bracket expressions to find multiple combinations
To search for Nurse and nurse together, the simplest way is with a bracket expression. Bracket expressions match any character in the brackets.
```
grep "[nN]urse" romeo-and-juliet.txt
```
Almost there! But, there are still the stage directions, like this [\_Exit Nurse_]
Filter output using regex, pipes, and the -v option.
We already have most of the output we need. We just need to filter out where Nurse appears between the expressions [_ and _]. To do this, we can use a regular expression: [_.*Nurse.*_].
.* means zero or more of any character.
A simple way narrow furter is to pipe our output into another grep command. The -v flag inverts matches: it prints only lines that do not contain the expression.
```
grep "[nN]urse" romeo-and-juliet.txt | grep -v "\[_.*Nurse.*_]"
```
Great! We have successfully…oh, wait.
It seems that Gutenberg uses a special format for stage directions when a person enters. So, we also need to filter out lines that look like this:
```
 Re-enter Nurse.
 Enter Nurse.
```

Use the -e flag for multiple expressions

grep "[Nn]urse" romeo-and-juliet.txt \
  | grep -v -e " Re-enter\| Enter" -e "\[_.*Nurse.*_]"

👉 We must escape the | character. Otherwise, grep will interpret it literally.

Congrats! Now we’ve really found every line where the word “nurse” is spoken in Romeo and Juliet. To do this, we used a combination of search, simple regex, and filters. This script is not very efficient or robust, but, for the purpose of our task, it’s acceptable.

The text is not going to change, and it’s not very long. Once our search prints what we need, we probably don’t need to run many more times.

If the text were long and dynamically changing, and the script needed to be run often, we’d probably need a more robust solution. See When grep is not so great.

Tutorial 2: Find all times Nurse speaks

This task is simpler. We’ve already discovered that the text introduces dialogue by printing the speaker’s name in all caps, followed by a period.

Use grep to print all, broad matches
```
grep "NURSE" romeo-and-juliet.txt
```
Inspect output for false positives
Everything looks good, except the first printed line.
```
NURSE to Juliet.
```
We need to include the period after the character’s name.
Use grep to print more specific matches
In this case, escape the . character. If you don’t, grep will match any character that follows the expression NURSE.
```
grep "NURSE\." romeo-and-juliet.txt
```
Use the -c option to count lines.
```
grep -c "Nurse\." romeo-and-juliet.txt
```

Now you know: the Nurse speaks 90 times in the play.

How to grep

To become a grep master, there are two things you need to memorize:

The command options
The regex patterns

Learning these is mostly a matter of memorization and practice.

After you’ve memorized these, you’ll develop your own method for using grep effectively. The steps in the second tutorial demonstrate a common process for problem-solving with grep (and all regex).

First, make a general search. Don’t get too complicated!
Inspect your output for false positives and for missing lines.
Make a more precise query, inspect again.
Repeat until grep prints exactly what you need

Now that you’ve used grep to solve some specific problems, it’s time to look at how to use grep in general cases.

How to `grep` over multiple files

There are a few ways to grep over multiple files:

With multiple arguments
With file globbing
With recursive search.

To grep with multiple files, just pass the files as arguments.

grep "bash" file1.sh file2.sh

You can also grep using file globs. This command checks for the expression bash across all shell files in the directiory.

grep "bash" *.sh

The -r option searches recursively through directories. It is one of the handiest options.

This command searches for all files with bin/bash in your scripts directory:

grep -r "bin/.*sh" ~/scripts

This should match all shells invoked, including /bin/bash, /bin/dash, bin/env bash, etc.

To add a little context to this script, use the -A option to print two lines after the pattern.

grep -rA 2 "bin/.*sh" ~/scripts

How to exclude results

There are multiple ways to exclude searches.

You might want to use the -v option to exclude lines. This was demonstrated in the first tutorial.

In a recursive search, you might want to search through only lines with a certain extension. In these cases, you can use the --exclude option. For example, to exclude yaml files, do something like this:

grep -rA 2 --exclude="*.yaml" "bin/.*sh" ~/scripts

Or, to exclude all .git directories, use --exclude-dir:

grep -rA 2 --exclude-dir=".git" "bin/.*sh" ~/scripts

How to combine `grep` with other programs

grep can take input from standard out. It’s often handy to use grep to filter output from another program. For example, to print all Firefox processes, you could run this command.

ps -ax|grep "firefox"

You can also pipe grep output to another program. For example, if a command’s output is too large to fit on your screen, you might want to pipe grep to less.

grep -rA 2 "bin/.*sh" ~/scripts | less

How to use `grep` with regex

Knowing how to use regex patterns can be really useful for grepping.

Here’s the last Shakespeare example: how would you search for every derivative form of the word love in Romeo and Juliet?

A match should print lines with forms like “lovers” or “loving,” but not lines with only the base word, “love.”

First, you can make your search case-insensitive.
```
grep -i "love" romeo-and-juliet.txt
```
However, this prints false positives, like “glove”.
Add the word boundary character, \b.
```
grep -i "\blove\b" romeo-and-juliet.txt
```
Now you’ve got all matches of love, including the word itself. But the task requires only derivatives of the word.
Extend the base pattern with the . character.
The base expression is lov. This expression is in all derivatives, like “loving” and for “lover”. To match any character, use the .
```
grep -i "\blov.\+\b" romeo-and-juliet.txt
```
Unfortunately, this commands prints exactly what we don’t want: only lines with the word “love”.
Use the \w word character to search for other word characters.
\w searches of all word-like charcters, i.e. alphanumerics.
```
grep -i "\blov.\w\b" romeo-and-juliet.txt
```
Much better! But this matches only 5-letter derivations, like “lover”.
Expand the pattern with \+
The \+ character matches one or more instances of the preceding item. In this instance, it searches for one or more instance of a \w character (i.e., any word character).
```
grep -i "\blov.\+\b" romeo-and-juliet.txt
```
Use the -E flag to avoid escaping all special charaters, like +.

Bingo! Let’s look at the first five results:

lovers
loving
lovers
lovers
lov’d

Besides predictable forms, like “loving”, there’s also surprising antiquated forms, like “lov’d”. A good grep can be pleasantly surprising.

How to use `grep` with process substition

Maybe you want to search for a regular expression that changes dynamically. In these cases, you can use process substition and variables to pass expressions to grep.

For example, consider a log file where each line starts with a date, in the format YYYY-MM-DD. Something like.

2021-12-02 <more-text>
2021-12-02 <more-text>
2021-12-01 <more-text>
2021-11-29 <more-text>
...
1970-01-01 <more-text>

Perhaps you want to know about how many times an event was logged in the current month. With process substition, you could use grep to create a dynamic report:

#!/bin/bash
#Print out how many times an event occurred this month

month=$(date +%B) # gets name of month
search=$(date +%Y-%m) # makes a search term from date, in YYYY-MM format
file=long-text.log
count=$(grep -c "^$search" "$file") "uses expression to count events in $file"
echo "This month, $month, has logged $count events.

When is grep not so great?

The beauty of grep is its simplicity. Don’t get too complicated.

In the examples of the tutorial and how-to, the data is well-structured, and the queries are relatively simple.

For example, the Romeo and Juliet text has a very precise way of defining when dialogue happens, and how stage direction happens. The log file might be very long file, but it also has very regular patterns. Every line begins with a date in one format.

For simple searches, or for matches across a large set of files, grep is very powerful.

However if you want to manipulate text, or work with specific fields of a file, you’ll probably want to use a more specific tool, like sed or awk.

For advanced text searching, like with natural langauge processing, it’s probably time to use a language with dedicated libraries to help you achieve your task.

Supplementary links

Want more grep? Here’s some grep-related links:

Video: Brian Kernighan talks about the origins of grep. One night in 1971.
The GNU grep manual. Everything that’s possible with GNU’s implementation of grep.
Why is GNU grep so fast? A technical discussion of an implementation.

The whole point with “everything is a file” is not that you have some random filename (indeed, sockets and pipes show that “file” and “filename” have nothing to do with each other), but the fact that you can use common tools to operate on different things.
[…]
The UNIX philosophy is often quoted as “everything is a file”, but that really means “everything is a stream of bytes”
— Linus Torvalds, Newsgroups: fa.linux.kernel

💻 Geeky Vibes! Check out our partner WrinkledT: Sustainable Style With a POP 💥 🌈 for the Soul (and the Planet!) 🌎💚

WrinkledT: Sustainable Style With a POP 💥