Your working directory is the folder on your computer in which you are currently working. When you ask R to open a certain file, it will look in the working directory for this file, and when you tell R to save a data file or figure, it will save it in the working directory.
Before you start working, please set your working directory tow here all your data and script files are or should be stored.
Type in the command window:
# On a mac, it'd look like this
setwd("~/projects/learn-r-journalism")
# On a pc, it might look like this
setwd("C:/Documents/learn-r-journalism")
Make sure that slashes are forward slashes and that you don’t forget the quotation marks. R is also case sensitive, so make sure you write cpitals where necessary
Within RStudio, you can also set the working directory via the menu Tools > Set Working Directory
R can do many statistical and data analyses.
They are organized in so-called packages or libraries. With the standard installation (known as base), most common packages are installed.
To get a list of all installed packages, go to the packages window or type library()
in the console window. If the box in front of the package name is ticked, the package is loaded (activated) and can be used.
There are many more packages available on the R website. If you want to install and use a package (for example, the packaged called “geometry”) you should:
install packages
in the packages window and type geometry
or type install.packages("geometry")
in the command window.gemoetry
or type library("geometry")
in the command window.R can be used as a calculator.
Just type an equation in the command window after the >
10^2 + 26
## [1] 126
You can give numbers a name.
By doing so, they become so-called variables which can be used later.
a <- 4
a
## [1] 4
You can do calculations with a
now.
a*5
## [1] 20
If you specify a
again, it will forget what value you had before because you did not assign it to anything.
a
## [1] 4
You can also assign a value to a
using the old one
a <- a + 10
a
## [1] 14
To remove all variables from R’s memory, type
rm(list=ls())
or click the “clear all” broom button in the workspace window.
Like in many other programs, R organizes numbers in scalars (a single number 0-dimensional), vectors (a row of numbers, also called arrays - `-dimensional) and matrices (like a table - 2-dimensional).
The a
you defined was scalar.
To define a vector with the numbers 3,4, and 5, you need the function c()
which is short for concatenate (or paste together).
b=c(3,4,5)
b
## [1] 3 4 5
If you would like to compute the mean of all the elements in the vector b
from the example above, you could type
(3+4+5)/3
## [1] 4
But when the vector is very long, this is very boring and time-consuming work.
Thi is why things you do often are automated in so-called functions. Some functions are standard in R or in one of the pakages. You can also program your own functions.
When you use a function to compute a mean, you’ll type
mean(x=b)
## [1] 4
Whithin the brackets you specify the arguments.
Arguments give extra information to the function. In this case, the argument x says of which set of numbers (vector) the mean should computed (namely of b).
Sometimes the name of the argument is not necessary:
mean(b)
## [1] 4
Also works.
R can make simple graphics right away.
# rnorm() is a base function that creates random samples from a random distribution
x <- rnorm(100)
# plot() is a base function that charts
plot(x)
R is an interpreter that uses a command line based environment.
This means that you have to type commands, rather than use the mouse and menus.
This has the advantage that you do not always have to retype commands.
You can store yoru commands in files, the so-called scripts. These scripts have typically file names with the extension .R as in script.R.
You can open an editor window to edit these files by clicking File > New or File > Open file…
You can run (send to the console window) part of the code by selecting lines and pressing CTRL+ENTER or CMD+ENTER or click the Run button at the top of the script editor window. If you do not select anything, R will run the line your cursor is on.
You can always run the whole script with the function source()
For example, to run the entire saved script.R if it’s in the root directory of the working directory, type
source("script.R")
You can also click Run all in the editor window or type CTRL+SHIFT+S or CMD+SHIFT+S
When you work with real data, you will encounter missing values because instrumentation failed or because you didn’t want to measure in the weekend.
When a data is not available, you write NA
instead of a number.
j <- c(1,2,NA)
Computing statistics of incomplete data sets is strictly not possible.
maybe the largest value occured during the weekend when you didn’t measure. Therefore, R will say that it doesn’t know what the largest value of j
is
max(j)
## [1] NA
If you don’t mind about the missing data and want to compute the statistics anyway, you can add the argument na.rm=TRUE
(Should I remove the NA
s? Yes)
max(j, na.rm=T)
## [1] 2
NA
s will also affect any sort of math if you’re not careful
sum(j)
## [1] NA
# compared to
sum(j, na.rm=T)
## [1] 3
We’ve been working so far with numbers.
But sometimes data we work with can be specified as something else, like characters and strings or boolean values like TRUE or FALSE or dates.
m <- "apples"
m
## [1] "apples"
To tell R that something is a character string, you should type the text between apostrophes, otherwise R will start looking for a defined veriable with the same name.
n <- pears
## Error in eval(expr, envir, enclos): object 'pears' not found
You can’t do math with characters.
m + 2
## Error in m + 2: non-numeric argument to binary operator
Dates and times are complicated.
R has to know that 3 o’clock comes after 2:59 and that February has 29 days in some years.
The base way to tell R that something is a date-time combination is with the function strptime()
date1 <- strptime(c("20100225230000", "20100226000000", "20100226010000"), format="%Y%m%d%H%M%S")
date1
## [1] "2010-02-25 23:00:00 EST" "2010-02-26 00:00:00 EST"
## [3] "2010-02-26 01:00:00 EST"
A vector is created with c()
and the numbers between the quotation marks are strings, because that’s what the strptim()
function requires.
That’s followed by the argument format that defines how the chracter string should be read. In this instance, the year is denoted first (%Y), then the month (%M) and second (%S). You don’t have to specify all of them, as long as the format corresponds to the character string.
In this course, we’ll be using a less messy way to deal with dates using the package lubridate.
# If you don't currently have the lubridate package installed, uncomment the line below and run it
# install.packages("lubridate")
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
date1 <- ymd_hms(c("20100225230000", "20100226000000", "20100226010000"))
Complicated.
They’re categorical variables that are useful for statisticians with plots and regression analysis.
For example, Race might be input as “White”, “Black”, and “Hispanic”
When importing that data in from a spreadsheet, R will most often interpret it as a Factor.
Let’s take a look at the structure behind a dataframe I’ve included, called sample_df
str(sample_df)
## 'data.frame': 4 obs. of 4 variables:
## $ id : Factor w/ 4 levels "1001","1002",..: 1 2 3 4
## $ name: chr "Steve" "Pam" "Jim" "Dwight"
## $ age : num 26 65 15 7
## $ race: Factor w/ 3 levels "Black","Hispanic",..: 3 1 3 2
R sees that the Race column is a factor variable with three levels.
levels(sample_df$race)
## [1] "Black" "Hispanic" "White"
This means that R groups statistics by these levels.
summary(sample_df$race)
## Black Hispanic White
## 1 1 2
Internally, R stores the integer values 1, 2, and 3, and maps the character strings in alphabetical order to these values. 1=Black, 2=Hispanic, and 3=White.
Why is this important to know?
Journalists are less concerned by factors and will often find themselves converting factors to strings and characters. But when you reach the point that you are wanting to create models and linear regressions then you’ll be happy that it’s an option.
Most odd quirks when it comes to R can be traced back to the fact that R was created by and for statisticians. R has grown a lot since then and the community has helped make it evolve to handle data the way we are more used to. But some habits die hard and are ingrained.
Self-explanatory. Saves as whole numbers or nummbers.
Here’s a warning.
sample_df$race
## [1] White Black White Hispanic
## Levels: Black Hispanic White
as.character(sample_df$race)
## [1] "White" "Black" "White" "Hispanic"
sample_df$name
## [1] "Steve" "Pam" "Jim" "Dwight"
factor(sample_df$name)
## [1] Steve Pam Jim Dwight
## Levels: Dwight Jim Pam Steve
sample_df$id
## [1] 1001 1002 1003 1004
## Levels: 1001 1002 1003 1004
as.numeric(sample_df$id)
## [1] 1 2 3 4
Because R stores Factors as Integer values.
You must convert factors into characters first before converting it to numbers.
So you can nest it.
sample_df$id
## [1] 1001 1002 1003 1004
## Levels: 1001 1002 1003 1004
as.numeric(as.character(sample_df$id))
## [1] 1001 1002 1003 1004
You can also save code you’ve written that simplifies your process into a function.
percent_change <- function(first_number, second_number) {
pc <- (second_number-first_number)/first_number*100
return(pc)
}
percent_change(100,150)
## [1] 50
This is what’s happening in the code above:
function()
pc
is created using some math calculating percent change from the two variables passed to itreturn()
assigns the result of the math to percent_change
from the first lineBuild enough functions and you can save them as your own package.
The important thing about functions is that they’re programmed by humans.
I constructed this function because that’s how I know that I’ll only include two inputs and that each one will be numeric and that they’ll be in order of first then second.
If you’re working in R and a function you’re using is giving an error, it most likely means there’s something wrong with one or more of the variables you’re passing to the function.
Here’s what happens when I pass a string to my percent_change()
function.
percent_change("what", "happens")
## Error in second_number - first_number: non-numeric argument to binary operator
Sometimes really great R programmers will anticipate errors and catch bad inputs and try to output helpful suggestions instead of a generic error.
This particular error isn’t very explicit. It needs to be interpreted but you know that the function needs numbers to work correctly.
New users might not know that intuitively.
And that’s how you’re going to feel when functions you run don’t work.
You’ll have to Google the error or peek into code to see if you can see what it expects and how it might break down thanks to what you’ve passed it.
And it won’t be entirely your fault.
When we’re coding and sharing it for others we can’t anticipate all the ways in which others might want to use it in the future.
Shoot the function writer a message or if you wrote the package, welcome feedback from others.
This is what makes participating in the R community so great. We just want to do better.