Delimited Text

Sometimes you’ll encounter data with values that are delimited (separated) by characters other than commas. For example, I once received a spreadsheet delimited with emojis.

Why does this happen? This means the data was exported from a database and the user chose this option. Not all data can be the way we ideally want it but as far as this data structure goes, at least we’re dealing with PDFs (which is possible but we won’t be going over that process in this course)

If opened in a spreadsheet app, the delimited file would be interpreted like any other spreadsheet.

But this is how a pipe-delimited file looks like internally.

And this is how a tab-delimited file looks.

In base R, the way to import these files is to use the read.table() function.

You pass it the location of the file (in this case, it’s in the sub directory “data”) and whether it has a header row or not and what separator symbol to look for

# read.table(file,  header=logical_value, sep="delimiter")
df1 <- read.table("data/Employee_Payroll_Pipe.txt", header=TRUE, sep="|")
View(df1)
# a \t indicates a tab (and a \n indicates a line break, like pressing enter in a document)
df2 <- read.table("data/Employee_Payroll_Tab.txt",  header=TRUE, sep="\t")
View(df2)

read_delim()

The downsides of using the base read.table() function are the same as using base read.csv()

  • Naming schemes aren’t consistent
  • Slow loading
  • Turns strings into Factors automatically

To read in delimited pipe files use read_delim() from readr

## If you don't have readr installed yet, uncomment and run the line below
# install.packages("readr")
library(readr)
df1 <- read_delim("data/Employee_Payroll_Pipe.txt", delim="|")
## Parsed with column specification:
## cols(
##   `Office Name` = col_character(),
##   `Job Code` = col_integer(),
##   `Job Title` = col_character(),
##   `Base Pay` = col_character(),
##   `Position ID` = col_integer(),
##   `Employee Identifier` = col_character(),
##   `Original Hire Date` = col_character()
## )
df1
## # A tibble: 23 x 7
##    `Office Name`  `Job Code` `Job Title`         `Base Pay`  `Position ID`
##    <chr>               <int> <chr>               <chr>               <int>
##  1 STATES ATTORN…       1172 Assistant State's … " $20,088.…       9510200
##  2 STATES ATTORN…       1172 Assistant State's … " $23,436.…       9510200
##  3 STATES ATTORN…       1172 Assistant State's … " $20,422.…       9510200
##  4 STATES ATTORN…       1172 Assistant State's … " $23,904.…       9510200
##  5 STATES ATTORN…       1172 Assistant State's … " $20,745.…       9510200
##  6 STATES ATTORN…       1172 Assistant State's … " $24,473.…       9510200
##  7 STATES ATTORN…       1172 Assistant State's … " $21,217.…       9510200
##  8 COUNTY ASSESS…       5049 Residential Model … " $17,770.…       9500731
##  9 COUNTY ASSESS…       5049 Residential Model … " $20,800.…       9500731
## 10 COUNTY ASSESS…       5049 Residential Model … " $17,873.…       9500731
## # ... with 13 more rows, and 2 more variables: `Employee
## #   Identifier` <chr>, `Original Hire Date` <chr>

read_tsv()

To read in tab delimited pipe files use read_tsv() from readr

df2 <- read_tsv("data/Employee_Payroll_Tab.txt")
## Parsed with column specification:
## cols(
##   `Office Name` = col_character(),
##   `Job Code` = col_integer(),
##   `Job Title` = col_character(),
##   `Base Pay` = col_character(),
##   `Position ID` = col_integer(),
##   `Employee Identifier` = col_character(),
##   `Original Hire Date` = col_character()
## )
df2
## # A tibble: 23 x 7
##    `Office Name`  `Job Code` `Job Title`          `Base Pay` `Position ID`
##    <chr>               <int> <chr>                <chr>              <int>
##  1 STATES ATTORN…       1172 Assistant State's A… $20,088.00       9510200
##  2 STATES ATTORN…       1172 Assistant State's A… $23,436.00       9510200
##  3 STATES ATTORN…       1172 Assistant State's A… $20,422.82       9510200
##  4 STATES ATTORN…       1172 Assistant State's A… $23,904.80       9510200
##  5 STATES ATTORN…       1172 Assistant State's A… $20,745.80       9510200
##  6 STATES ATTORN…       1172 Assistant State's A… $24,473.38       9510200
##  7 STATES ATTORN…       1172 Assistant State's A… $21,217.35       9510200
##  8 COUNTY ASSESS…       5049 Residential Model S… $17,770.86       9500731
##  9 COUNTY ASSESS…       5049 Residential Model S… $20,800.67       9500731
## 10 COUNTY ASSESS…       5049 Residential Model S… $17,873.76       9500731
## # ... with 13 more rows, and 2 more variables: `Employee
## #   Identifier` <chr>, `Original Hire Date` <chr>

Fixed width files

Sometimes you’ll get data with fixed width columns.

It’ll look like this.

Just use the read_fwf() function from the readr package.

This is what it needs– pulled from typing ?read_fwf in the console:

read_fwf(file, col_positions, col_types = NULL, locale = default_locale(),
  na = c("", "NA"), comment = "", trim_ws = TRUE, skip = 0,
  n_max = Inf, guess_max = min(n_max, 1000), progress = show_progress())

A couple of important things you need for this to work:

  • Pass the widths of each column to the variable
  • The names of those columns

There are many methods for this, so be sure to check out the documentation.

# After looking at the raw data, the header starts on line 7. So be sure to pass that information to the function with "skip"

data_location <- "data/fixed_width_example.txt"

fixed_example <- read_fwf(data_location, skip=9, fwf_widths(c(8, 2, 12, 12, 29, 3,6, 9, 5, 18, 20, 8), c("entry", "per", "post_date", "gl_account", "description", "source", "cflow", "ref", "post", "debit", "credit", "alloc")))

head(fixed_example)
## # A tibble: 6 x 12
##    entry per   post_date  gl_account description  source cflow ref   post 
##    <int> <chr> <chr>           <dbl> <chr>        <chr>  <chr> <chr> <chr>
## 1  16524 01    10/17/2012 3930621977 TXNPUES      S1     Yes   RHMX… Yes  
## 2 191675 01    01/14/2013 2368183100 OUNHQEX XUF… S1     No    <NA>  Yes  
## 3 191667 01    01/14/2013 3714468136 GHAKASC QHJ… S1     Yes   <NA>  Yes  
## 4 191673 01    01/14/2013 2632703881 PAHFSAP LUV… S1     No    <NA>  Yes  
## 5  80495 01    11/21/2012 2766389794 XDZANTV      S1     Yes   TGZG… Yes  
## 6  80507 01    11/21/2012 4609266335 BWWYEZL      S1     Yes   USUK… Yes  
## # ... with 3 more variables: debit <dbl>, credit <dbl>, alloc <chr>

So the example above took a lot of work– I had to manually count the spaces of each column and then pass on the column names for each one. Sometimes there’s a data dictionary that includes all this, which makes it way easier.


© Copyright 2018, Andrew Ba Tran

© Copyright 2018, Andrew Tran