Delimited Text

Sometimes you’ll encounter data with values that are delimited (separated) by characters other than commas. For example, I once received a spreadsheet delimited with emojis.

Why does this happen? This means the data was exported from a database and the user chose this option. Not all data can be the way we ideally want it but as far as this data structure goes, at least we’re dealing with PDFs (which is possible but we won’t be going over that process in this course)

If opened in a spreadsheet app, the delimited file would be interpreted like any other spreadsheet.

But this is how a pipe-delimited file looks like internally.

And this is how a tab-delimited file looks.

In base R, the way to import these files is to use the read.table() function.

You pass it the location of the file (in this case, it’s in the sub directory “data”) and whether it has a header row or not and what separator symbol to look for

# read.table(file,  header=logical_value, sep="delimiter")
df1 <- read.table("data/Employee_Payroll_Pipe.txt", header=TRUE, sep="|")
## Warning in read.table("data/Employee_Payroll_Pipe.txt", header = TRUE,
## sep = "|"): incomplete final line found by readTableHeader on 'data/
## Employee_Payroll_Pipe.txt'
View(df1)
# a \t indicates a tab (and a \n indicates a line break, like pressing enter in a document)
df2 <- read.table("data/Employee_Payroll_Tab.txt",  header=TRUE, sep="\t")
## Warning in read.table("data/Employee_Payroll_Tab.txt", header = TRUE,
## sep = "\t"): incomplete final line found by readTableHeader on 'data/
## Employee_Payroll_Tab.txt'
View(df2)

read_delim()

The downsides of using the base read.table() function are the same as using base read.csv()

  • Naming schemes aren’t consistent
  • Slow loading
  • Turns strings into Factors automatically

To read in delimited pipe files use read_delim() from readr

## If you don't have readr installed yet, uncomment and run the line below
# install.packages("readr")
library(readr)
df1 <- read_delim("data/Employee_Payroll_Pipe.txt", delim="|")
## Parsed with column specification:
## cols(
##   `Office Name` = col_character(),
##   `Job Code` = col_double(),
##   `Job Title` = col_character(),
##   `Base Pay` = col_character(),
##   `Position ID` = col_double(),
##   `Employee Identifier` = col_character(),
##   `Original Hire Date` = col_character()
## )
df1
## # A tibble: 23 x 7
##    `Office Name` `Job Code` `Job Title` `Base Pay` `Position ID`
##    <chr>              <dbl> <chr>       <chr>              <dbl>
##  1 STATES ATTOR…       1172 Assistant … " $20,088…       9510200
##  2 STATES ATTOR…       1172 Assistant … " $23,436…       9510200
##  3 STATES ATTOR…       1172 Assistant … " $20,422…       9510200
##  4 STATES ATTOR…       1172 Assistant … " $23,904…       9510200
##  5 STATES ATTOR…       1172 Assistant … " $20,745…       9510200
##  6 STATES ATTOR…       1172 Assistant … " $24,473…       9510200
##  7 STATES ATTOR…       1172 Assistant … " $21,217…       9510200
##  8 COUNTY ASSES…       5049 Residentia… " $17,770…       9500731
##  9 COUNTY ASSES…       5049 Residentia… " $20,800…       9500731
## 10 COUNTY ASSES…       5049 Residentia… " $17,873…       9500731
## # … with 13 more rows, and 2 more variables: `Employee Identifier` <chr>,
## #   `Original Hire Date` <chr>

read_tsv()

To read in tab delimited pipe files use read_tsv() from readr

df2 <- read_tsv("data/Employee_Payroll_Tab.txt")
## Parsed with column specification:
## cols(
##   `Office Name` = col_character(),
##   `Job Code` = col_double(),
##   `Job Title` = col_character(),
##   `Base Pay` = col_character(),
##   `Position ID` = col_double(),
##   `Employee Identifier` = col_character(),
##   `Original Hire Date` = col_character()
## )
df2
## # A tibble: 23 x 7
##    `Office Name` `Job Code` `Job Title` `Base Pay` `Position ID`
##    <chr>              <dbl> <chr>       <chr>              <dbl>
##  1 STATES ATTOR…       1172 Assistant … $20,088.00       9510200
##  2 STATES ATTOR…       1172 Assistant … $23,436.00       9510200
##  3 STATES ATTOR…       1172 Assistant … $20,422.82       9510200
##  4 STATES ATTOR…       1172 Assistant … $23,904.80       9510200
##  5 STATES ATTOR…       1172 Assistant … $20,745.80       9510200
##  6 STATES ATTOR…       1172 Assistant … $24,473.38       9510200
##  7 STATES ATTOR…       1172 Assistant … $21,217.35       9510200
##  8 COUNTY ASSES…       5049 Residentia… $17,770.86       9500731
##  9 COUNTY ASSES…       5049 Residentia… $20,800.67       9500731
## 10 COUNTY ASSES…       5049 Residentia… $17,873.76       9500731
## # … with 13 more rows, and 2 more variables: `Employee Identifier` <chr>,
## #   `Original Hire Date` <chr>

Fixed width files

Sometimes you’ll get data with fixed width columns.

It’ll look like this.

Just use the read_fwf() function from the readr package.

This is what it needs– pulled from typing ?read_fwf in the console:

read_fwf(file, col_positions, col_types = NULL, locale = default_locale(),
  na = c("", "NA"), comment = "", trim_ws = TRUE, skip = 0,
  n_max = Inf, guess_max = min(n_max, 1000), progress = show_progress())

A couple of important things you need for this to work:

  • Pass the widths of each column to the variable
  • The names of those columns

There are many methods for this, so be sure to check out the documentation.

# After looking at the raw data, the header starts on line 7. So be sure to pass that information to the function with "skip"

data_location <- "data/fixed_width_example.txt"

fixed_example <- read_fwf(data_location, skip=9, fwf_widths(c(8, 2, 12, 12, 29, 3,6, 9, 5, 18, 20, 8), c("entry", "per", "post_date", "gl_account", "description", "source", "cflow", "ref", "post", "debit", "credit", "alloc")))

head(fixed_example)
## # A tibble: 6 x 12
##    entry per   post_date gl_account description source cflow ref   post 
##    <dbl> <chr> <chr>          <dbl> <chr>       <chr>  <chr> <chr> <chr>
## 1  16524 01    10/17/20… 3930621977 TXNPUES     S1     Yes   RHMX… Yes  
## 2 191675 01    01/14/20… 2368183100 OUNHQEX XU… S1     No    <NA>  Yes  
## 3 191667 01    01/14/20… 3714468136 GHAKASC QH… S1     Yes   <NA>  Yes  
## 4 191673 01    01/14/20… 2632703881 PAHFSAP LU… S1     No    <NA>  Yes  
## 5  80495 01    11/21/20… 2766389794 XDZANTV     S1     Yes   TGZG… Yes  
## 6  80507 01    11/21/20… 4609266335 BWWYEZL     S1     Yes   USUK… Yes  
## # … with 3 more variables: debit <dbl>, credit <dbl>, alloc <chr>

So the example above took a lot of work– I had to manually count the spaces of each column and then pass on the column names for each one. Sometimes there’s a data dictionary that includes all this, which makes it way easier.


© Copyright 2018, Andrew Ba Tran

© Copyright 2018, Andrew Tran