Preface

This is the online version of the work-in-progress 2nd edition of Data Wrangling with R

Welcome to the second edition of Data Wrangling with R! In this book, I will help you learn the essentials of preprocessing data leveraging the R programming language to easily and quickly turn noisy data into usable pieces of information. Data wrangling, which is also commonly referred to as data munging, transformation, manipulation, janitor work, etc. can be a painstakingly laborious process. In fact, it has been stated that up to 80% of data analysis is spent on the process of cleaning and preparing data (Wickham 2014; Dasu and Johnson 2003). However, being a prerequisite to the rest of the data analysis workflow (visualization, modeling, reporting), it’s essential that you become fluent and efficient in data wrangling techniques.

This book will guide you through the data wrangling process along with give you a solid foundation of the basics of working with data in R. My goal is to teach you how to easily wrangle your data, so you can spend more time focused on understanding the content of your data via visualization, modeling, and reporting your results. By the time you finish reading this book, you will have learned how to work with the different data types and structures, acquire and parse data from locations you may not have been able to access before, manage control structures, implement efficient workflows, reshape and transform your data, and even perform tasks beyond the borders of your laptop. In essence, you will have the data wrangling toolbox required for modern day data analysis.

Who should read this

This book is meant to establish the baseline R vocabulary and knowledge for the primary data wrangling processes. This captures a wide range of programming activities which covers the full spectrum from understanding basic data objects in R to writing your own functions, applying loops, and webscraping. As a result, this book can be beneficial to all levels of R programmers. Beginner R programmers will gain a basic understanding of the functionality of R along with learning how to work with data using R. Intermediate and/or advanced R programmers will likely find the early chapters reiterating established knowledge; however, these programmers will benefit from the mid and later chapters by learning newer and/or more efficient data wrangling techniques.

What You Need For this Book

Obviously to gain and retain knowledge from this book it is highly recommended that you follow along and practice the code examples yourself. Furthermore, this book assumes that you will actually be performing data wrangling in R; therefore, it is assumed that you have or plan to have R installed on your computer. You will find the latest version of R for Linux, Mac OS, and Windows at cran.r-project.org/. It is also recommended that you use an integrated development environment (IDE) as it will simplify and organize your coding environment greatly. There are several to choose from; however, I highly recommend RStudio.

Conventions used in this book

The following typographical conventions are used in this book:

  • strong italic: indicates new terms,
  • bold: indicates package & file names,
  • inline code: monospaced highlighted text indicates functions or other commands that could be typed literally by the user,
  • code chunk: indicates commands or other text that could be typed literally by the user

In addition to the general text used throughout, you will notice the following code chunks with images:

Signifies a tip or suggestion

Signifies a general note

Signifies a warning or caution

Feedback

Reader comments are greatly appreciated. To report errors or bugs please post an issue at https://github.com/bradleyboehmke/dw-r/issues.

Acknowledgments

I’d like to thank everyone who contributed feedback, typo corrections, and discussions while the book was being written. TBD.

Software information

This book was built with the following packages and R version. All code was executed on 2017 MacBook Pro with a 2.9 GHz Intel Core i7 processor, 16 GB of memory, 2133 MHz speed, and double data rate synchronous dynamic random access memory (DDR3).

# packages used
pkgs <- c(
  "completejourney",
  "tidyverse"
)

# package & session info
sessioninfo::session_info(pkgs)
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.0 (2020-04-24)
#>  os       Ubuntu 16.04.6 LTS          
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language en_US.UTF-8                 
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       UTC                         
#>  date     2020-05-29                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  ! package         * version  date       lib source        
#>    assertthat        0.2.1    2019-03-21 [1] CRAN (R 4.0.0)
#>    BH                1.72.0-3 2020-01-08 [1] CRAN (R 4.0.0)
#>    cli               2.0.2    2020-02-28 [1] CRAN (R 4.0.0)
#>    completejourney   1.1.0    2019-09-28 [1] CRAN (R 4.0.0)
#>    crayon            1.3.4    2017-09-16 [1] CRAN (R 4.0.0)
#>    curl              4.3      2019-12-02 [1] CRAN (R 4.0.0)
#>    digest            0.6.25   2020-02-23 [1] CRAN (R 4.0.0)
#>    dplyr             0.8.5    2020-03-07 [1] CRAN (R 4.0.0)
#>    ellipsis          0.3.1    2020-05-15 [1] CRAN (R 4.0.0)
#>    fansi             0.4.1    2020-01-08 [1] CRAN (R 4.0.0)
#>    glue              1.4.1    2020-05-13 [1] CRAN (R 4.0.0)
#>    hms               0.5.3    2020-01-08 [1] CRAN (R 4.0.0)
#>    lifecycle         0.2.0    2020-03-06 [1] CRAN (R 4.0.0)
#>    magrittr          1.5      2014-11-22 [1] CRAN (R 4.0.0)
#>    pillar            1.4.4    2020-05-05 [1] CRAN (R 4.0.0)
#>    pkgconfig         2.0.3    2019-09-22 [1] CRAN (R 4.0.0)
#>    plogr             0.2.0    2018-03-25 [1] CRAN (R 4.0.0)
#>    prettyunits       1.1.1    2020-01-24 [1] CRAN (R 4.0.0)
#>    progress          1.2.2    2019-05-16 [1] CRAN (R 4.0.0)
#>    purrr             0.3.4    2020-04-17 [1] CRAN (R 4.0.0)
#>    R6                2.4.1    2019-11-12 [1] CRAN (R 4.0.0)
#>    Rcpp              1.0.4.6  2020-04-09 [1] CRAN (R 4.0.0)
#>    rlang             0.4.6    2020-05-02 [1] CRAN (R 4.0.0)
#>    stringi           1.4.6    2020-02-17 [1] CRAN (R 4.0.0)
#>    stringr           1.4.0    2019-02-10 [1] CRAN (R 4.0.0)
#>    tibble            3.0.1    2020-04-20 [1] CRAN (R 4.0.0)
#>    tidyselect        1.1.0    2020-05-11 [1] CRAN (R 4.0.0)
#>  R tidyverse         <NA>     <NA>       [?] <NA>          
#>    utf8              1.1.4    2018-05-24 [1] CRAN (R 4.0.0)
#>    vctrs             0.3.0    2020-05-11 [1] CRAN (R 4.0.0)
#>    zeallot           0.1.0    2018-01-28 [1] CRAN (R 4.0.0)
#> 
#> [1] /home/travis/R/Library
#> [2] /usr/local/lib/R/site-library
#> [3] /home/travis/R-bin/lib/R/library
#> 
#>  R ── Package was removed from disk.

References

Dasu, Tamraparni, and Theodore Johnson. 2003. Exploratory Data Mining and Data Cleaning. Vol. 479. John Wiley & Sons.

Wickham, Hadley. 2014. “Tidy Data.” The Journal of Statistical Software 59. http://www.jstatsoft.org/v59/i10/.