6 Dealing with Regular Expressions

A regular expression (aka regex) is a sequence of characters that define a search pattern, mainly for use in pattern matching with text strings. Typically, regex patterns consist of a combination of alphanumeric characters as well as special characters. The pattern can also be as simple as a single character or it can be more complex and include several characters.

Regex are not a specific data type; however, to work efficiently with character strings you will likely need to understand some basic regex.

To understand how to work with regular expressions in R, we need to consider two primary features of regular expressions. One has to do with the syntax, or the way regex patterns are expressed in R. The other has to do with the functions used for regex matching in R. In this chapter, we will cover both of these aspects. First, I cover the syntax that allows you to perform pattern matching functions with meta characters, character and POSIX classes, and quantifiers. This will provide you with the basic understanding of the syntax required to establish the pattern to find. Then I cover the functions provided in base R and in the stringr package you can apply to identify, extract, replace, and split parts of character strings based on the regex pattern specified.

6.1 Regex Syntax

At first glance (and second, third,…) the regex syntax can appear quite confusing. This section will provide you with the basic foundation of regex syntax; however, realize that there is a plethora of resources available that will give you far more detailed, and advanced, knowledge of regex syntax. To read more about the specifications and technicalities of regex in R you can find help at help(regex) or help(regexp).

6.1.1 Metacharacters

Metacharacters consist of non-alphanumeric symbols such as . \ | ( ) [ { $ * + ?. To match metacharacters in R you need to escape them with a double backslash “\”. The following displays the general escape syntax for the most common metacharacters:

Escape syntax for common metacharacters.

Figure 6.1: Escape syntax for common metacharacters.

The following provides examples to show how to use the escape syntax to find and replace metacharacters. For information on the sub and gsub functions used in this example visit the main regex functions section.

6.1.2 Sequences

To match a sequence of characters we can apply short-hand notation which captures the fundamental types of sequences. The following displays the general syntax for these common sequences:

Anchors for common sequences.

Figure 6.2: Anchors for common sequences.

The following provides examples to show how to use the anchor syntax to find and replace sequences. For information on the gsub function used in this example visit the main regex functions section.

6.1.3 Character classes

To match one of several characters in a specified set we can enclose the characters of concern with square brackets [ ]. In addition, to match any characters not in a specified character set we can include the caret ^ at the beginning of the set within the brackets. The following displays the general syntax for common character classes but these can be altered easily as shown in the examples that follow:

Anchors for common character classes.

Figure 6.3: Anchors for common character classes.

The following provides examples to show how to use the anchor syntax to match character classes. For information on the grep function used in this example visit the main regex functions section.

6.1.4 POSIX character classes

Closely related to regex character classes are POSIX character classes which are expressed in double brackets [[ ]].

POSIX Character Classes.

Figure 6.4: POSIX Character Classes.

The following provides examples to show how to use the anchor syntax to match POSIX character classes. For information on the grep function used in this example visit the main regex functions section.

6.1.5 Quantifiers

When we want to match a certain number of characters that meet a certain criteria we can apply quantifiers to our pattern searches. The quantifiers we can use are:

Quantifiers in R.

Figure 6.5: Quantifiers in R.

The following provides examples to show how to use the quantifier syntax to match a certain number of characters patterns. For information on the grep function used in this example visit the main regex functions section. Note that state.name is a built in dataset within R that contains all the U.S. state names.

6.2 Regex Functions in Base R

R contains a set of functions in the base package that we can use to find pattern matches. Alternatively, the R package stringr also provides several functions for regex operations. This section covers the base R functions that provide pattern finding, pattern replacement, and string splitting capabilities.

6.2.1 Pattern Finding Functions

There are five functions that provide pattern matching capabilities. The three functions that I provide examples for are ones that are most common. The two other functions which I do not illustrate are gregexpr() and regexec() which provide similar capabilities as regexpr() but with the output in list form.

  • Pattern matching with values or indices as outputs
  • Pattern matching with logical (TRUE/FALSE) outputs
  • Identifying the location in the string where the patter exists

6.2.1.3 regexpr( )

To find exactly where the pattern exists in a string use regexpr():

The output of regexpr() can be interepreted as follows. The first element provides the starting position of the match in each element. Note that the value -1 means there is no match. The second element (attribute “match length”) provides the length of the match. The third element (attribute “useBytes”) has a value TRUE meaning matching was done byte-by-byte rather than character-by-character.

6.3 Regex Functions with stringr

Similar to basic string manipulation, the stringr package also offers regex functionality. In some cases the stringr performs the same functions as certain base R functions but with more consistent syntax. In other cases stringr offers additional functionality that is not available in the base R functions. The stringr functions we’ll cover focus on detecting, locating, extracting, and replacing patterns along with string splitting.

6.3.2 Locating Patterns

To locate the occurrences of patterns stringr offers two options: i) locate the first matching occurrence or ii) locate all occurrences. To locate the position of the first occurrence of a pattern in a string vector use str_locate(). The output provides the starting and ending position of the first match found within each element.

To locate the positions of all pattern match occurrences in a character vector use str_locate_all(). The output provides a list the same length as the number of elements in the vector. Each list item will provide the starting and ending positions for each pattern match occurrence in its respective element.

6.3.3 Extracting Patterns

For extracting a string containing a pattern, stringr offers two primary options: i) extract the first matching occurrence or ii) extract all occurrences. To extract the first occurrence of a pattern in a character vector use str_extract(). The output will be the same length as the string and if no match is found the output will be NA for that element.

To extract all occurrences of a pattern in a character vector use str_extract_all(). The output provides a list the same length as the number of elements in the vector. Each list item will provide the matching pattern occurrence within that relative vector element.

6.3.4 Replacing Patterns

For extracting a string containing a pattern, stringr offers two options: i) replace the first matching occurrence or ii) replace all occurrences. To replace the first occurrence of a pattern in a character vector use str_replace(). This function is a wrapper for sub().

To extract all occurrences of a pattern in a character vector use str_replace_all(). This function is a wrapper for gsub().

6.4 Additional Resources

Character string data are often considered semi-structured data. Text can be structured in a specified field; however, the quality and consistency of the text input can be far from structured. Consequently, managing and manipulating character strings can be extremely tedious and unique to each data wrangling process. As a result, taking the time to learn the nuances of dealing with character strings and regex functions can provide a great return on investment; however, the functions and techniques required will likey be greater than what I could offer here. So here are additional resources that are worth reading and learning from:

6.5 Exercises

TBD

References

Sanchez, Gaston. 2013. “Handling and Processing Strings in R.” Trowchez Editions, Berkeley.