Improving our pipeline means looking for ways, big and small, to improve our code.
We get an opportunity right off the bat! Rather than copy/pasting the entire URL we can just take the sheet ID of of our Google Sheet. Getting the same information with less keystrokes and opportunity for error is always a win.
# Since this Google Sheet is public to anyone with the link we don't need# to authenticategs4_deauth()# Get the ID of our Google Sheetsheet_id <-"1v0lG-4arxF_zCCpfoUzCydwzaea7QqWTTQzTr8Dompw"# Read in the penguins data using the googlesheets4 packagepenguins_remote_improved <-read_sheet(sheet_id)
2.3 Still Read in Raw Data
We’ll continue our pipeline as in the last chapter by writing our just ingested data to the “raw/” folder in our R Project.
Let’s start by using the {janitor} package to clean up our column names. This will make it easier to work with the data moving forward.
A big part of data engineering is naming things consistently and choosing how to format your variables. Naming things sounds simple but is notoriously difficult.
Keeping names simple, avoiding ambiguity where possible, and using a consistent structure for names can all help. The {janitor} package helps us enforce a consistent structure for our variables in one line of code.
In the raw Palmer Penguins data the variable names’ structure is all over the place. “studyName” uses camel case, there’s inconsistent capitalization across names, special characters like “(” that can cause issues on some systems, and there are often spaces.1
After running clean_names() we get consistent “snake case”2 separation for words, consistent lower-case for all characters, no special characters like “(” and no spaces. This standardization makes it easier to remember, type, and reuse these variable names. Enforce a standard like this for your data and your end-users will thank you.
Now that we’ve made our variable names easier to work with let’s dive into the actual data. The scan_data() function from the {pointblank} package in R makes it easy to get a high level overview of our data. I want the ten-thousand foot view first, so let’s look at just the overview of our data frame.
Over 40% of our data is missing, which seems like a bad sign. The scan_data() function can help us see where this data is missing quickly with a handy visualization.
Okay, so we’ve already narrowed our problem down to a certain set of columns where the data seems to be almost entirely missing. Maybe it’s on purpose and we’re fine?
Not quite. Most of the “missing” data are hiding in list-columns, a column type in R that can contain a whole list within each row. So the data isn’t so much “missing” as “inaccessible in its current format.”
This list-column property can be useful in certain cases, but it wasn’t what we were going for here. What do we do now?
2.7 Data Engineering is About Tradeoffs + Procedures
One rule we’ve likely all heard is to never touch “raw” data. However, in a data engineering role the “raw” data can be completely unusable, nonsensical, or worse.
Having a reproducible pipeline with automated checks for data quality can help, and there’s still a human3 making these decisions.
Therefore, while it’s good practice to not edit “raw” data files once they’re established we still want the “raw” data to contain the information we need. Let’s figure out how to rescuse the information out of those list-columns.
Since I happen to know these values are “nested” in lists maybe we can access them by using the unnest() function. Let’s try that on the offending columns.
Error in `list_unchop()`:
! Can't combine `x[[1]]` <double> and `x[[4]]` <character>.
2.8 Are We Doomed?
Nope! Errors are an inevitable part of improving a pipeline. Also, if you’re not finding anything weird with your pipeline I’d be more nervous than if you find errors like this one.
In this case, we see there are values of multiple types4 in a column that can’t be combined. I have a hunch based on experience, which I can investigate further by looking at the “sex” column.
# A tibble: 3 × 2
sex n
<chr> <int>
1 FEMALE 165
2 MALE 168
3 NA 11
Aha! It looks like the value “NA” is getting read as a character variable rather than a missing value. This could be what’s going on with our initial read using read_sheet() Let’s investigate.
penguins_na_fix <-read_sheet(sheet_id, na ="NA")penguins_na_fix %>%clean_names() %>%select(c(culmen_length_mm:body_mass_g,delta_15_n_o_oo,delta_13_c_o_oo))
Fantastic, the information from these columns seems to be available to us now. Let’s check the overall dataframe using scan data again, this time with the overview + missingness plot generated simultaneously.
# A tibble: 344 × 1
comments
<chr>
1 Not enough blood for isotopes.
2 <NA>
3 <NA>
4 Adult not sampled.
5 <NA>
6 <NA>
7 Nest never observed with full clutch.
8 Nest never observed with full clutch.
9 No blood sample obtained.
10 No blood sample obtained for sexing.
# ℹ 334 more rows
Based on our investigation and the Palmer Penguins documentation this data is looking much more like the data we’d expect.
2.9 An Already Improved Pipeline
We have a much better pipeline than we did just a chapter ago!
How did I know about the na argument inside of read_sheet()? I looked at the documentation! Also, as you develop better instincts you can press Ctrl + Spacebar to see what parameters like na are available inside of a function like read_sheet().
Let’s read this version with our NA fix into the folder so we can continue to upgrade our pipeline. We’re going to overwrite the previous, non-useful version of the file. However, we need to be careful when building pipelines that we only overwrite files when we mean to!