Why a clear data analysis workflow?
These are all things I picked up from browsing other presentations and repos.
Much thanks to Jenny Bryan and Joris Muller from whom I cobbled many of these ideas and practices from.
Also to BuzzFeed, FiveThirtyEight, ProPublica, Chicago Tribune, Los Angeles Times, and TrendCT.org
DO NOT USE setwd()
Keep everything relative to your project directory and it will work on everyone who downloads your project repo folder.
here("Test", "Folder", "text.txt")
##> [1] "/Users/IRE/Projects/NICAR/2018/workflow/Test/Folder/test.txt"
cat(readLines(here("Test", "Folder", "text.txt")))
##> You found the text file nested in these subdirectories!
name_of_project
|--data
|--2017report.csv
|--2016report.pdf
|--summary2016_2017.csv
|--docs
|--01-analysis.Rmd
|--01-analysis.html
|--scripts
|--exploratory_analysis.R
|--name_of_project.Rproj
|--run_all.R
name_of_project
|--raw_data
|--WhateverData.xlsx
|--2017report.csv
|--2016report.pdf
|--output_data
|--summary2016_2017.csv
|--rmd
|--01-analysis.Rmd
|--docs
|--01-analysis.html
|--01-analysis.pdf
|--02-deeper.html
|--02-deeper.pdf
|--scripts
|--exploratory_analysis.R
|--pdf_scraper.R
|--name_of_project.Rproj
|--run_all.R
Everything below is for more advanced users but I’m putting it here for future reference.
folder_names <- c("raw_data", "output_data", "rmd", "docs", "scripts")
sapply(folder_names, dir.create)
if (!file.exists("data/bostonpayroll2013.csv")) {
dir.create("data", showWarnings = F)
download.file(
"https://website.com/data/bostonpayroll2013.csv",
"data/bostonpayroll2013.csv")
}
payroll <- read_csv("data/bostonpayroll2013.csv")
if (!file.exists("data/employment/2016-12/FACTDATA_DEC2016.TXT")) {
dir.create("data", showWarnings = F)
temp <- tempfile()
download.file(
"https://website.com/data/bostonpayroll2013.zip",
temp)
unzip(temp, exdir="data", overwrite=T)
unlink(temp)
}
payroll <- read_csv("data/bostonpayroll2013.csv")
Comment your code
Anything that appears on a line after
#
will be treated as a comment. That means it will be ignored when the code in the script is executed.Use this to explain what the code does.
Get into this habit early. Future readers of the code will be grateful for the clear documentation you leave behind– including yourself months from now.