4 Dealing with Numbers
In this chapter you will learn the basics of working with numbers in R. This includes understanding how to manage the numeric type (integer vs. double), the different ways of generating non-random and random numbers, how to set seed values for reproducible random number generation, and the different ways to compare and round numeric values.
4.1 Numeric Types (integer vs. double)
The two most common numeric classes used in R are integer and double (for double precision floating point numbers). R automatically converts between these two classes when needed for mathematical purposes. As a result, it’s feasible to use R and perform analyses for years without specifying these differences.
4.1.1 Creating Integer and Double Vectors
By default, when you create a numeric vector using the c()
function it will produce a vector of double precision numeric values. To create a vector of integers using c()
you must specify explicity by placing an L
directly after each number.
4.1.2 Checking for Numeric Type
To check whether a vector is made up of integer or double values:
4.1.3 Converting Between Integer and Double Values
By default, if you read in data that has no decimal points or you create numeric values using the x <- 1:10
method the numeric values will be coded as integer. If you want to change a double to an integer or vice versa you can specify one of the following:
# converts integers to double-precision values
as.double(int_var)
## [1] 1 6 10
# identical to as.double()
as.numeric(int_var)
## [1] 1 6 10
# converts doubles to integers
as.integer(dbl_var)
## [1] 1 2 4
Although all three instances above do not print out the decimal, if you checked the type of the object with typeof(as.double(int_var))
you would in fact see that it is a double floating point.
4.2 Generating Non-random Numbers
There are a few R operators and functions that are especially useful for creating vectors of non-random numbers. These functions provide multiple ways for generating sequences of numbers.
4.2.1 Specifing Numbers within a Sequence
To explicitly specify numbers in a sequence you can use the colon :
operator to specify all integers between two specified numbers or the combine c()
function to explicity specify all numbers in the sequence.
4.2.2 Generating Regular Sequences
A generalization of :
is the seq()
function, which generates a sequence of numbers with a specified arithmetic progression.
# generate a sequence of numbers from 1 to 21 by increments of 2
seq(from = 1, to = 21, by = 2)
## [1] 1 3 5 7 9 11 13 15 17 19 21
# generate a sequence of numbers from 1 to 21 that has 15 equal incremented numbers
seq(0, 21, length.out = 15)
## [1] 0.0 1.5 3.0 4.5 6.0 7.5 9.0 10.5 12.0 13.5 15.0 16.5 18.0 19.5 21.0
4.2.3 Generating Repeated Sequences
The rep()
function allows us to conveniently repeat specified constants into long vectors. This function allows for collated and non-collated repetitions.
4.3 Generating Random Numbers
Simulation is a common practice in data analysis. Sometimes your analysis requires the implementation of a statistical procedure that requires random number generation or sampling (i.e. Monte Carlo simulation, bootstrap sampling, etc). R comes with a set of pseudo-random number generators that allow you to simulate the most common probability distributions such as:
4.3.1 Uniform numbers
To generate random numbers from a uniform distribution you can use the runif()
function. Alternatively, you can use sample()
to take a random sample using with or without replacements.
# generate n random numbers between the default values of 0 and 1
runif(n)
# generate n random numbers between 0 and 25
runif(n, min = 0, max = 25)
# generate n random numbers between 0 and 25 (with replacement)
sample(0:25, n, replace = TRUE)
# generate n random numbers between 0 and 25 (without replacement)
sample(0:25, n, replace = FALSE)
For example, to generate 25 random numbers between the values 0 and 10:
runif(25, min = 0, max = 10)
## [1] 4.8607590 8.5877346 1.9188738 7.0117873 7.0197845 3.0863653 4.5780395
## [8] 6.8954020 0.3202288 2.4103050 8.4993870 9.8574833 4.4712443 4.0346192
## [15] 8.9609042 6.8254551 9.3966741 9.3009633 4.1688575 1.8247144 3.0413893
## [22] 4.9572826 3.8652020 7.1419686 9.4727101
For each non-uniform probability distribution there are four primary functions available to generate random numbers, density (aka probability mass function), cumulative density, and quantiles. The prefixes for these functions are:
r
: random number generationd
: density or probability mass functionp
: cumulative distributionq
: quantiles
4.3.2 Normal Distribution Numbers
The normal (or Gaussian) distribution is the most common and well know distribution. Within R, the normal distribution functions are written as norm()
.
# generate n random numbers from a normal distribution with given mean & st. dev.
rnorm(n, mean = 0, sd = 1)
# generate CDF probabilities for value(s) in vector q
pnorm(q, mean = 0, sd = 1)
# generate quantile for probabilities in vector p
qnorm(p, mean = 0, sd = 1)
# generate density function probabilites for value(s) in vector x
dnorm(x, mean = 0, sd = 1)
For example, to generate 25 random numbers from a normal distribution with mean = 100
and
standard deviation = 15
:
x <- rnorm(25, mean = 100, sd = 15)
x
## [1] 102.98320 95.34052 85.55814 100.22128 95.21110 92.66095 108.35895
## [8] 102.88705 104.83506 129.09899 90.85024 82.48825 93.32509 83.03384
## [15] 89.44639 117.77383 122.25275 113.60422 87.15246 108.18519 117.35427
## [22] 100.40000 77.82333 123.83163 111.69154
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 77.82 90.85 100.40 101.45 111.69 129.10
You can also pass a vector of values. For instance, say you want to know the CDF probabilities for each value in the vector x
created above:
pnorm(x, mean = 100, sd = 15)
## [1] 0.57882164 0.37804024 0.16782635 0.50588499 0.37476478 0.31232528
## [7] 0.71132630 0.57631292 0.62640131 0.97380606 0.27093615 0.12151433
## [13] 0.32816207 0.12901134 0.24084899 0.88197587 0.93103138 0.81778287
## [19] 0.19585994 0.70735686 0.87635382 0.51063729 0.06964442 0.94394445
## [25] 0.78213853
4.3.3 Binomial Distribution Numbers
This is conventionally interpreted as the number of successes in size = x
trials and with prob = p
probability of success:
# generate a vector of length n displaying the number of successes from a trial
# size = 100 with a probabilty of success = 0.5
rbinom(n, size = 100, prob = 0.5)
# generate CDF probabilities for value(s) in vector q
pbinom(q, size = 100, prob = 0.5)
# generate quantile for probabilities in vector p
qbinom(p, size = 100, prob = 0.5)
# generate density function probabilites for value(s) in vector x
dbinom(x, size = 100, prob = 0.5)
4.3.4 Poisson Distribution Numbers
The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occuring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.
# generate a vector of length n displaying the random number of events occuring
# when lambda (mean rate) equals 4.
rpois(n, lambda = 4)
# generate CDF probabilities for value(s) in vector q when lambda (mean rate)
# equals 4.
ppois(q, lambda = 4)
# generate quantile for probabilities in vector p when lambda (mean rate)
# equals 4.
qpois(p, lambda = 4)
# generate density function probabilites for value(s) in vector x when lambda
# (mean rate) equals 4.
dpois(x, lambda = 4)
4.3.5 Exponential Distribution Numbers
The Exponential probability distribution describes the time between events in a Poisson process.
# generate a vector of length n with rate = 1
rexp(n, rate = 1)
# generate CDF probabilities for value(s) in vector q when rate = 4.
pexp(q, rate = 1)
# generate quantile for probabilities in vector p when rate = 4.
qexp(p, rate = 1)
# generate density function probabilites for value(s) in vector x when rate = 4.
dexp(x, rate = 1)
4.3.6 Gamma Distribution Numbers
The Gamma probability distribution is related to the Beta distribution and arises naturally in processes for which the waiting times between Poisson distributed events are relevant.
# generate a vector of length n with shape parameter = 1
rgamma(n, shape = 1)
# generate CDF probabilities for value(s) in vector q when shape parameter = 1.
pgamma(q, shape = 1)
# generate quantile for probabilities in vector p when shape parameter = 1.
qgamma(p, shape = 1)
# generate density function probabilites for value(s) in vector x when shape
# parameter = 1.
dgamma(x, shape = 1)
4.4 Setting Seed Values
If you want to generate a sequence of random numbers and then be able to reproduce that same sequence of random numbers later you can set the random number seed generator with set.seed()
. This is a critical aspect of reproducible research.
For example, we can reproduce a random generation of 10 values from a normal distribution:
set.seed(197)
rnorm(n = 10, mean = 0, sd = 1)
## [1] 0.6091700 -1.4391423 2.0703326 0.7089004 0.6455311 0.7290563
## [7] -0.4658103 0.5971364 -0.5135480 -0.1866703
set.seed(197)
rnorm(n = 10, mean = 0, sd = 1)
## [1] 0.6091700 -1.4391423 2.0703326 0.7089004 0.6455311 0.7290563
## [7] -0.4658103 0.5971364 -0.5135480 -0.1866703
4.5 Comparing Numeric Values
There are multiple ways to compare numeric values and vectors. This includes logical operators along with testing for exact equality and also near equality.
4.5.1 Comparison Operators
The normal binary operators allow you to compare numeric values and provides the answer in logical form:
x < y # is x less than y
x > y # is x greater than y
x <= y # is x less than or equal to y
x >= y # is x greater than or equal to y
x == y # is x equal to y
x != y # is x not equal to y
These operations can be used for single number comparison:
and also for comparison of numbers within vectors:
Note that logical values TRUE
and FALSE
equate to 1 and 0 respectively. So if you want to identify the number of equal values in two vectors you can wrap the operation in the sum()
function:
If you need to identify the location of pairwise equalities in two vectors you can wrap the operation in the which()
function:
4.5.2 Exact Equality
To test if two objects are exactly equal:
4.5.3 Floating Point Comparison
Sometimes you wish to test for ‘near equality’. The all.equal()
function allows you to test for equality with a difference tolerance of 1.5e-8.
If the difference is greater than the tolerance level the function will return the mean relative difference:
4.6 Rounding numeric Values
There are many ways of rounding to the nearest integer, up, down, or toward a specified decimal place. Assuming we have the following vector x
:
The following illustrates the common ways to round x
:
# Round to the nearest integer
round(x)
## [1] 1 1 2 2 2 3 3 3 4 4 4 5 5 6 6
# Round up
ceiling(x)
## [1] 1 2 2 3 3 3 4 4 4 5 5 5 6 6 6
# Round down
floor(x)
## [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
# Round to a specified decimal
round(x, digits = 1)
## [1] 1.0 1.4 1.7 2.0 2.4 2.8 3.1 3.5 3.8 4.2 4.5 4.8 5.2 5.6 5.9
4.7 Exercises
- Generate a sequence of non-random numbers from 1 to 100 by increments of 2. Save the output to an object
x
. - Generate 50 random numbers between 0 and 100 with a uniform distribution. Set the seed to 123 so you can reproduce the same numbers. Save the output to an object
y
. - Round
y
to the nearest integer digit. - Compare
x
toy
element-wise to find out how many of thex
values are less than the correspondingy
elements.