5 Dealing with Character Strings

Dealing with character strings is often under-emphasized in data analysis training. The focus typically remains on numeric values; however, the growth in data collection is also resulting in greater bits of information embedded in character strings. Consequently, handling, cleaning and processing character strings is becoming a prerequisite in daily data analysis. This chapter is meant to give you the foundation of working with characters by covering some basics followed by learning how to manipulate strings using base R functions along with using the simplified stringr package.

5.1 Character string basics

Character string basics includes how to create, convert and print character strings along with how to count the number of elements and characters in a string.

5.1.2 Converting to Strings

Test if strings are characters with is.character() and convert strings to character with as.character() or with toString().

5.1.3 Printing Strings

The common printing methods include:

  • print(): generic printing
  • noquote(): print with no quotes
  • cat(): concatenate and print with no quotes
  • sprintf(): a wrapper for the C function sprintf, that returns a character vector containing a formatted combination of text and variable values

The primary printing function in R is print()

An alternative to printing a string without quotes is to use noquote()

Another very useful function is cat() which allows us to concatenate objects and print them either on screen or to a file. The output result is very similar to noquote(); however, cat() does not print the numeric line indicator. As a result, cat() can be useful for printing nicely formated responses to users.

You can also format the line width for printing long strings using the fill argument:

sprintf() is a useful printing function for precise control of the output. It is a wrapper for the C function sprintf and returns a character vector containing a formatted combination of text and variable values.

To substitute in a string or string variable, use %s:

For integers, use %d or a variant:

For floating-point numbers, use %f for standard notation, and %e or %E for exponential notation:

5.2 String manipulation with base R

Basic string manipulation typically inludes case conversion, simple character, abbreviating, substring replacement, adding/removing whitespace, and performing set operations to compare similarities and differences between two character vectors. These operations can all be performed with base R functions; however, some operations (or at least their syntax) are greatly simplified with the stringr package. This section illustrates base R string manipulation for case conversion, simple character replacement, abbreviating, and substring replacement. Many of the other fundamental string manipulation tasks will be covered in Sections 5.3 and 5.4 that follow.

5.2.1 Case conversion

To convert all upper case characters to lower case use tolower():

To convert all lower case characters to upper case use toupper():

5.2.2 Simple Character Replacement

To replace a character (or multiple characters) in a string you can use chartr():

Note that chartr() replaces every identified letter for replacement so the only time I use it is when I am certain that I want to change every possible occurence of a letter.

5.2.3 String Abbreviations

To abbreviate strings you can use abbreviate():

Note that if you are working with U.S. states, R already has a pre-built vector with state names (state.name). Also, there is a pre-built vector of abbreviated state names (state.abb).

5.2.4 Extract/Replace Substrings

To extract or replace substrings in a character vector there are three primary base R functions to use: substr(), substring(), and strsplit(). The purpose of substr() is to extract and replace substrings with specified starting and stopping characters:

The purpose of substring() is to extract and replace substrings with only a specified starting point. substring() also allows you to extract/replace in a recursive fashion:

To split the elements of a character string use strsplit():

Note that the output of strsplit() is a list. To convert the output to a simple atomic vector simply wrap in unlist():

5.3 String manipulation with stringr

The stringr package was developed by Hadley Wickham to act as simple wrappers that make R’s string functions more consistent, simple, and easier to use. To replicate the functions in this section you will need to install and load the stringr package:

5.3.1 Basic Operations

There are three string functions that are closely related to their base R equivalents, but with a few enhancements:

  • Concatenate with str_c()
  • Number of characters with str_length()
  • Substring with str_sub()

str_c() is equivalent to the paste() functions:

str_length() is similiar to the nchar() function; however, str_length() behaves more appropriately with missing (‘NA’) values:

str_sub() is similar to substr(); however, it returns a zero length vector if any of its inputs are zero length, and otherwise expands each argument to match the longest. It also accepts negative positions, which are calculated from the left of the last character.

5.3.4 Pad a String with Whitespace

To add whitespace, or to pad a string, use str_pad(). You can also use str_pad() to pad a string with specified characters.

5.4 Set operatons for character strings

There are also base R functions that allows for assessing the set union, intersection, difference, equality, and membership of two vectors. I also cover sorting character strings.

5.4.2 Set Intersection

To obtain the common elements of two character vectors use intersect():

5.4.3 Identifying Different Elements

To obtain the non-common elements, or the difference, of two character vectors use setdiff():

5.4.4 Testing for Element Equality

To test if two vectors contain the same elements regardless of order use setequal():

5.4.5 Testing for Exact Equality

To test if two character vectors are equal in content and order use identical():

5.4.6 Identifying if Elements are Contained in a String

To test if an element is contained within a character vector use is.element() or %in%:

5.4.7 Sorting a String

To sort a character vector use sort():

5.5 Exercises

TBD