2 Introduction to the R Language

A language for data analysis and graphics. This definition of R was used by Ross Ihaka and Robert Gentleman in the title of their 1996 paper (Ihaka and Gentleman 1996) outlining their experience of designing and implementing the R software. It’s safe to say this remains the essence of what R is; however, it’s tough to encapsulate such a diverse programming language into a single phrase. During the last decade, the R programming language has become one of the most widely used tools for statistics and data science. Its application runs the gamut from data preprocessing, cleaning, web scraping and visualization to a wide range of analytic tasks such as computational statistics, econometrics, optimization, natural language processing and more (Boehmke and Greenwell 2019). In 2012 R had over 2 million users and the language has continued to consistently year over year¹. R has become an essential analytic software throughout industry; being used by organizations such as Google, Facebook, New York Times, Twitter, Etsy, Department of Defense, and even in presidential political campaigns. So what makes R such a popular tool?

2.1 Open Source

R is an open source software created over 20 years ago by Ihaka and Gentleman at the University of Auckland, New Zealand. However, its history is even longer as its lineage goes back to the S programming language created by John Chambers out of Bell Labs back in the 1970s². R is actually a combination of S with lexical scoping semantics inspired by Scheme (Morandat et al. 2012). Whereas the resulting language is very similar in appearance to S, the underlying implementation and semantics are derived from Scheme. Unbeknownst to many the S language has been a popular vehicle for research in statistical methodology, and R provides an open source route to participate in that activity.

Although the history of S and R is interesting³, the principal artifact to observe is that R is an open source software. Open-source software⁴ such as R blurs the distinction between developer and user, which provides the ability to extend and modify the analytic functionality to your, or your organization’s, needs. The data analysis process is rarely restricted to just a handful of tasks with predictable input and outputs that can be pre-defined by a fixed user interface as is common in proprietary software. Rather, as previously mentioned in the introduction, data analyses include unique, different, and often multiple requirements regarding the specific tasks involved. Open source software allows more flexibility for you, the data scientist, to manage how data are being transformed, manipulated, and modeled “under the hood” of software rather than relying on “stiff” point and click software interfaces. Open source also allows you to operate on every major platform rather than be restricted to what your personal budget allows or the idiosyncratic purchases of organizations.

2.2 Flexibility

Another benefit of open source is that anybody can access the source code, modify and improve it. As a result, many excellent programmers contribute to improving existing R code and developing new capabilities. Researchers from all walks of life (academic institutions, industry, and organizations such as RStudio and rOpenSci) are contributing to advancements of R’s capabilities and best practices. This has resulted in some powerful tools that advance both statistical and non-statistical modeling capabilities that are taking data analysis to new levels.

Many researchers in academic institutions are using and developing R code to develop the latest techniques in statistics and machine learning. As part of their research, they often publish an R package to accompany their research articles⁵. This provides immediate access to the latest analytic techniques and implementations. And this research is not soley focused on generalized algorithms as many new capabilities are in the form of advancing analytic algorithms for tasks in specific domains⁶. A quick assessment of the different task domains for which code is being developed illustrates the wide spectrum - econometrics, finance, chemometrics & computational physics, pharmacokinetics, social sciences, etc.

Powerful tools are also being developed to perform many tasks that greatly aid the data analysis process. This is not limited to just new ways to wrangle your data but also new ways to visualize and communicate data. R packages are now making it easier than ever to create interactive graphics and websites and produce sophisticated html and PDF reports. R packages are also integrating communication with high-performance programming languages such as C, Fortran, and C++ making data analysis more powerful, efficient, and posthaste than ever.

So although the analytic mantra “use the right tool for the problem” should always be in our prefrontal cortex, the advancements and flexibility of R is making it the right tool for many problems.

2.3 Community

R is incredible software for statistics and data science. But while the bits and bytes of software are an essential component of its usefulness, software needs a community to be successful. And that’s an area where R really shines. For software, a thriving community needs to offer something for everyone - data scientists, developers, experts, educators, newbies, collaborators, writers, testers, and so much more. R has really shined in this area as outlined by Shannon Ellis. A few of the incredible places you are sure to help, good conversation, diversity, and a place to feel at home include:

Twitter: The default social hangout for R users where the #rstats hashtag is alive and thriving.
RStudio Community: A community for all things R and RStudio where you can get direct answers to your problems and also give back by helping to solve and answer other’s questions.
RStudio Education: A vast resources of high quality content for both learners and educators of the R programming language.
R-bloggers: An unbelievable resource where hundreds of blog posts from many different bloggers using R are aggregated.
R-Ladies: A world-wide organization focused on promoting gender diversity within the R community, with more than 30 local chapters
Local R meetup groups: A google search may show that there’s one in your area! If not, maybe consider starting one! Face-to-face meet-ups for users of all levels are incredibly valuable
DataCarpentry and Software Carpentry: A resource of openly available lessons that promote and model reproducible research
Stack Overflow: Chances are your R question has already been answered here (with additional resources for people looking for jobs).
And the list could go on and on but you get the point!

So now that you know how awesome R is, it’s time to learn how to use it.

References

Boehmke, Brad, and Brandon M Greenwell. 2019. Hands-on Machine Learning with R. CRC Press.

Ihaka, Ross, and Robert Gentleman. 1996. “R: A Language for Data Analysis and Graphics.” Journal of Computational and Graphical Statistics 5 (3). Taylor & Francis Group: 299–314.

Morandat, Floréal, Brandon Hill, Leo Osvald, and Jan Vitek. 2012. “Evaluating the Design of the R Language.” In European Conference on Object-Oriented Programming, 104–31. Springer.

https://stackoverflow.blog/2017/10/10/impressive-growth-r/↩
Consequently, R is named partly after its authors (Ross and Robert) and partly as a play on the name of S.↩
See Roger Peng’s R programming for Data Science (Peng 2016) for further, yet concise, details on S and R’s history.↩
Open-source is far from new as it has been around for decades (i.e. A-2 in the 1950s, IBM’s ACP in the ’60s, Tiny BASIC in the ’70s) but has gained prominence since the late 1990s.↩
Examples include The Journal of Statistical Software and The R Journal.↩
See https://cran.r-project.org/web/views/ for domain categorized packages.↩