2 Initial Pipeline Upgrades

2.1 Load Packages

library(googlesheets4)
library(tidyverse)
library(here)
library(janitor)
library(pointblank)

2.2 Improve How We Call The Sheet

Improving our pipeline means looking for ways, big and small, to improve our code.

We get an opportunity right off the bat! Rather than copy/pasting the entire URL we can just take the sheet ID of of our Google Sheet. Getting the same information with less keystrokes and opportunity for error is always a win.

# Since this Google Sheet is public to anyone with the link we don't need
# to authenticate

gs4_deauth()

# Get the ID of our Google Sheet

sheet_id <- "1v0lG-4arxF_zCCpfoUzCydwzaea7QqWTTQzTr8Dompw"

# Read in the penguins data using the googlesheets4 package

penguins_remote_improved <- read_sheet(sheet_id)

2.3 Still Read in Raw Data

We’ll continue our pipeline as in the last chapter by writing our just ingested data to the “raw/” folder in our R Project.

output_path_improved <- here::here("data_from_gsheets/raw/palmer_penguins_improved.csv")

write_csv(penguins_remote_improved, output_path_improved)

2.4 Add Some Pre-Processing

Let’s start by using the {janitor} package to clean up our column names. This will make it easier to work with the data moving forward.

A big part of data engineering is naming things consistently and choosing how to format your variables. Naming things sounds simple but is notoriously difficult.

Keeping names simple, avoiding ambiguity where possible, and using a consistent structure for names can all help. The {janitor} package helps us enforce a consistent structure for our variables in one line of code.

penguins_improved_initial <- 
  read_csv("data_from_gsheets/raw/palmer_penguins_improved.csv") %>% 
  clean_names()

In the raw Palmer Penguins data the variable names’ structure is all over the place. “studyName” uses camel case, there’s inconsistent capitalization across names, special characters like “(” that can cause issues on some systems, and there are often spaces.¹

penguins_remote_improved %>% 
  names()

 [1] "studyName"           "Sample Number"       "Species"            
 [4] "Region"              "Island"              "Stage"              
 [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
[10] "Culmen Length (mm)"  "Culmen Depth (mm)"   "Flipper Length (mm)"
[13] "Body Mass (g)"       "Sex"                 "Delta 15 N (o/oo)"  
[16] "Delta 13 C (o/oo)"   "Comments"

After running clean_names() we get consistent “snake case”² separation for words, consistent lower-case for all characters, no special characters like “(” and no spaces. This standardization makes it easier to remember, type, and reuse these variable names. Enforce a standard like this for your data and your end-users will thank you.

penguins_improved_initial %>% 
  names()

 [1] "study_name"        "sample_number"     "species"          
 [4] "region"            "island"            "stage"            
 [7] "individual_id"     "clutch_completion" "date_egg"         
[10] "culmen_length_mm"  "culmen_depth_mm"   "flipper_length_mm"
[13] "body_mass_g"       "sex"               "delta_15_n_o_oo"  
[16] "delta_13_c_o_oo"   "comments"

2.5 Scanning Our Data for Problems

Now that we’ve made our variable names easier to work with let’s dive into the actual data. The scan_data() function from the {pointblank} package in R makes it easy to get a high level overview of our data. I want the ten-thousand foot view first, so let’s look at just the overview of our data frame.

scan_data(penguins_improved_initial, sections = "O")

Overview
Reproducibility

Table Overview

Columns	17
Rows	344
`NA`s	2,365 (40.44%)
Duplicate Rows	0

Column Types

character	9
logical	6
POSIXct	1
numeric	1

Reproducibility Information

Scan Build Time	`2023-09-27 10:39:52`
pointblank Version	`0.11.4`
R Version	R version 4.1.3 (2022–03–10) One Push–Up
Operating System	`x86_64-apple-darwin17.0`

2.6 Where Did Our Data Go??

Over 40% of our data is missing, which seems like a bad sign. The scan_data() function can help us see where this data is missing quickly with a handy visualization.

scan_data(penguins_improved_initial, sections = "M")

Okay, so we’ve already narrowed our problem down to a certain set of columns where the data seems to be almost entirely missing. Maybe it’s on purpose and we’re fine?

penguins_remote_improved %>% 
  clean_names() %>% 
  select(c(culmen_length_mm:body_mass_g,delta_15_n_o_oo,delta_13_c_o_oo))

# A tibble: 344 × 6
   culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g
   <list>           <list>          <list>            <list>     
 1 <dbl [1]>        <dbl [1]>       <dbl [1]>         <dbl [1]>  
 2 <dbl [1]>        <dbl [1]>       <dbl [1]>         <dbl [1]>  
 3 <dbl [1]>        <dbl [1]>       <dbl [1]>         <dbl [1]>  
 4 <chr [1]>        <chr [1]>       <chr [1]>         <chr [1]>  
 5 <dbl [1]>        <dbl [1]>       <dbl [1]>         <dbl [1]>  
 6 <dbl [1]>        <dbl [1]>       <dbl [1]>         <dbl [1]>  
 7 <dbl [1]>        <dbl [1]>       <dbl [1]>         <dbl [1]>  
 8 <dbl [1]>        <dbl [1]>       <dbl [1]>         <dbl [1]>  
 9 <dbl [1]>        <dbl [1]>       <dbl [1]>         <dbl [1]>  
10 <dbl [1]>        <dbl [1]>       <dbl [1]>         <dbl [1]>  
# ℹ 334 more rows
# ℹ 2 more variables: delta_15_n_o_oo <list>, delta_13_c_o_oo <list>

Not quite. Most of the “missing” data are hiding in list-columns, a column type in R that can contain a whole list within each row. So the data isn’t so much “missing” as “inaccessible in its current format.”

This list-column property can be useful in certain cases, but it wasn’t what we were going for here. What do we do now?

2.7 Data Engineering is About Tradeoffs + Procedures

One rule we’ve likely all heard is to never touch “raw” data. However, in a data engineering role the “raw” data can be completely unusable, nonsensical, or worse.

Having a reproducible pipeline with automated checks for data quality can help, and there’s still a human ³ making these decisions.

Therefore, while it’s good practice to not edit “raw” data files once they’re established we still want the “raw” data to contain the information we need. Let’s figure out how to rescuse the information out of those list-columns.

Since I happen to know these values are “nested” in lists maybe we can access them by using the unnest() function. Let’s try that on the offending columns.

penguins_remote_improved %>%
  clean_names() %>%
  unnest(c(culmen_length_mm:body_mass_g,delta_15_n_o_oo,delta_13_c_o_oo))

Error in `list_unchop()`:
! Can't combine `x[[1]]` <double> and `x[[4]]` <character>.

2.8 Are We Doomed?

Nope! Errors are an inevitable part of improving a pipeline. Also, if you’re not finding anything weird with your pipeline I’d be more nervous than if you find errors like this one.

In this case, we see there are values of multiple types⁴ in a column that can’t be combined. I have a hunch based on experience, which I can investigate further by looking at the “sex” column.

penguins_remote_improved %>% 
  clean_names() %>% 
  count(sex)

# A tibble: 3 × 2
  sex        n
  <chr>  <int>
1 FEMALE   165
2 MALE     168
3 NA        11

Aha! It looks like the value “NA” is getting read as a character variable rather than a missing value. This could be what’s going on with our initial read using read_sheet() Let’s investigate.

penguins_na_fix <- read_sheet(sheet_id, na = "NA")

penguins_na_fix %>% 
  clean_names() %>% 
  select(c(culmen_length_mm:body_mass_g,delta_15_n_o_oo,delta_13_c_o_oo))

# A tibble: 344 × 6
   culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g
              <dbl>           <dbl>             <dbl>       <dbl>
 1             39.1            18.7               181        3750
 2             39.5            17.4               186        3800
 3             40.3            18                 195        3250
 4             NA              NA                  NA          NA
 5             36.7            19.3               193        3450
 6             39.3            20.6               190        3650
 7             38.9            17.8               181        3625
 8             39.2            19.6               195        4675
 9             34.1            18.1               193        3475
10             42              20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>

Fantastic, the information from these columns seems to be available to us now. Let’s check the overall dataframe using scan data again, this time with the overview + missingness plot generated simultaneously.

scan_data(penguins_na_fix, sections = "OM")

Overview
Reproducibility

Table Overview

Columns	17
Rows	344
`NA`s	336 (5.75%)
Duplicate Rows	0

Column Types

character	9
numeric	7
POSIXct	1

Reproducibility Information

Scan Build Time	`2023-09-27 10:39:55`
pointblank Version	`0.11.4`
R Version	R version 4.1.3 (2022–03–10) One Push–Up
Operating System	`x86_64-apple-darwin17.0`

Down to ~6% missing data! Most of that missingness seems concentrated in the “comments” column, which we can take a quick peek at.

penguins_na_fix %>% 
  clean_names() %>% 
  select(comments)

# A tibble: 344 × 1
   comments                             
   <chr>                                
 1 Not enough blood for isotopes.       
 2 <NA>                                 
 3 <NA>                                 
 4 Adult not sampled.                   
 5 <NA>                                 
 6 <NA>                                 
 7 Nest never observed with full clutch.
 8 Nest never observed with full clutch.
 9 No blood sample obtained.            
10 No blood sample obtained for sexing. 
# ℹ 334 more rows

Based on our investigation and the Palmer Penguins documentation this data is looking much more like the data we’d expect.

2.9 An Already Improved Pipeline

We have a much better pipeline than we did just a chapter ago!

How did I know about the na argument inside of read_sheet()? I looked at the documentation! Also, as you develop better instincts you can press Ctrl + Spacebar to see what parameters like na are available inside of a function like read_sheet().

Let’s read this version with our NA fix into the folder so we can continue to upgrade our pipeline. We’re going to overwrite the previous, non-useful version of the file. However, we need to be careful when building pipelines that we only overwrite files when we mean to!

output_path_nafix <- here::here("data_from_gsheets/raw/palmer_penguins_improved.csv")

write_csv(penguins_na_fix, output_path_nafix)

Which are annoying to remember and computers hate↩︎
Where each word is separated by an “_”↩︎
or ideally a set of humans↩︎
type double and type character↩︎

2 Initial Pipeline Upgrades

2.1 Load Packages

2.2 Improve How We Call The Sheet

2.3 Still Read in Raw Data

2.4 Add Some Pre-Processing

2.5 Scanning Our Data for Problems

Overview of `penguins_improved_initial`

2.6 Where Did Our Data Go??

Missing Values

2.7 Data Engineering is About Tradeoffs + Procedures

2.8 Are We Doomed?

Overview of `penguins_na_fix`

Missing Values

2.9 An Already Improved Pipeline

2.1 Load Packages

2.2 Improve How We Call The Sheet

2.3 Still Read in Raw Data

2.4 Add Some Pre-Processing

2.5 Scanning Our Data for Problems

Overview of penguins_improved_initial

2.6 Where Did Our Data Go??

Missing Values

2.7 Data Engineering is About Tradeoffs + Procedures

2.8 Are We Doomed?

Overview of penguins_na_fix

Missing Values

2.9 An Already Improved Pipeline

Overview of `penguins_improved_initial`

Overview of `penguins_na_fix`