18 Data wrangling using R

Warning

🚧 This section is being actively worked on. 🚧

18.1 Learning objectives

The learning objectives for this session are:

Udvise forståelse for usikkerheds- og sandsynlighedsbegrebet samt grundliggende begreber indenfor biostatistik
Udvise kendskab til de grundliggende overordnede studietyper samt at skelne mellem forklarende, eksplorative og prædiktive studier
Redegøre for forskellige typer af tilfældig og ikke tilfældig variation
Udvise forståelse for statistiske værktøjers begrænsninger og muligheder
Forstå statistiske problemstillinger, der er centrale for medicin med industrial specialisering og forstå, hvordan de biostatistiske værktøjer kan appliceres på disse problemstillinger

18.2 TODO: Section

Reading task: ~4 minutes

TODO: Reading text

18.3 🧑‍💻 Exercise: Title

Time: ~NUM minutes.

Text

Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

18.4 💬 Discussion activity: Title

Time: ~Num minutes.

Text

18.5 📖 Reading task: Title

Time: ~Num minutes.

Text

Sticky/hat up!

When you’re ready to continue, place the sticky/paper hat on your computer to indicate this to the teacher 👒 🎩

18.6 Summary

TODO: List of summary items

18.7 Survey

Feedback survey! 🎉

18.8 📖 Reading task: Working with messy and tidy data

Instructor note

For the reading section, emphasize the characteristics of a “tidy” dataset.

Time: ~12 minutes

The concept of “tidy” data was popularized in an article (Wickham2014?) by Hadley Wickham and described in more detail in the Tidy Data chapter of the R for Data Science online book. Before we continue with tidy data, we need to cover something that is related to the concept of “tidy” and that has already come up in this workshop: the tidyverse. The tidyverse is an ecosystem of R packages that are designed to work well together, that all follow a strong “design philosophy” and common style guide. This makes combining these packages in the tidyverse much easier. We teach the tidyverse because of these reasons.

Part of the “tidy” part of tidyverse revolves around tidy data. A tidy dataset is when:

Each variable has its own column (e.g. “Body Weight”).
Each observation has its own row (e.g. “Person”).
Each value has its own cell (e.g. “Body weight for a person at a specific date”).

Take a look at the example “tidy” and “messy” data frames (also called “tibbles” in the tidyverse) below. Think about why each may be considered “tidy” or “messy”. What do you notice between the tidy versions and the messier versions?

library(tidyr)

# Datasets come from tidyr
# Tidy:
table1

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

# Partly tidy:
table2

# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

# Messier:
table3

# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

# Messy:
table4a

# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

# Messy:
table4b

# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

The “most” tidy version is table1 as each column describes their values (e.g. population is population size), each row is unique (e.g. first row is for values from Afghanistan from 1999), and each cell is an explicit value representative of its column and row.

table2 is a “long” version of table1 so it is partly “tidy”, but it doesn’t satisfy the rule that each variable has a column, since count represents both cases and population size.

On the other hand, table3 is messy because the rate column values are a composite of two other column values (cases and population), when it should be a single number (a percent). Both table4a and table4b have columns with ambiguous values inside. For example, you can’t tell from the data what the values in the 1999 column contain.

Tidy data has a few notable benefits:

Time spent preparing your data to be tidy from the beginning can save days of added work and frustration in the long run.
“Tidy data” is a conceptual framework that allows you to easily build off and wrangle (i.e. “manipulate”, “clean up”, “manage”) data in simpler and easy-to-interpret ways, especially when used within the framework of the tidyverse.
Tidier data is easier to visualize and “map” onto different axes of the plot, like the x and y axes.

The concept of tidy data also gives rise to “tidy code” for visualizing and wrangling. By using “verbs” (R functions) and chaining them together in “sentences” (in a sequential pipeline), you can construct meaningful and readable code that describes in plainer English what you are doing to the data. This is one simple way that you can enhance the reproducibility of your code. There are other ways to make your code tidier and more readable, all of which we will cover in this workshop.

Document what you did to your data and why you did it, to help you remember later on by using comments with # hashes as well as using Markdown text to describe things that can’t be easily explained by the code, for example, the “why” behind the code.
Keep the code simple: Don’t be clever nor overly concise, always be clear even if it means more code. Clear code is easier to understand than clever and sometimes overly-complex or -concise code.

Whether working with either messy or tidy data, there are a few principles to follow:

You should always save your original raw dataset in the data-raw/ folder.
- Note: Whether or not you save data to data-raw/ depends on how you collected the data and how many collaborators are on your team. You may end up storing and processing the data in another folder as a project of its own.
Never edit your raw data directly. If you have to edit it to fix mistakes, have some version control of it so you can track it and see what has been changed.
Only work with your raw data using R code. As much as possible or is reasonable, don’t manually edit it. Manual editing doesn’t leave a history of what you’ve done to it, so you can’t go back and see what you’ve done. Always keep a history of any changes you’ve made to the data, preferably by using R code.
Save the edited data as another dataset and store it in the data/ folder.

We are saving to data/ because the dataset is already collected and published, so the original raw data can always be downloaded again.

Origami hats up!

When you’re ready to continue, place the paper hat on your computer to indicate this to the teacher 👒 🎩

18.9 💬 Discussion activity: How tidy is our data?

Time: ~10

Now that we’ve learned about the concept of “tidy” data, let’s discuss how tidy our data is. At a glance, there are approximately 7 things in this data that are mildly untidy or could be improved. Can you identify them?

On your own, take 2 minutes to look over the data and try to find what could be improved.
With your neighbour, take 5 minutes and discuss what you found and see if you have overlap in agreement or if you found different things.
As a group, we will take 3 minutes and we will share what you all discussed and then go over the 7 things.

🧑‍🏫 Instructor note

The 6 main things are:

Column names are not tidy or standardised. For instance, there are . in the names and some use abbreviations.
Some column names also contain data. For instance, all the columns with numbers at the end are time points, like Insulin1 is insulin measured at 1 minute after eating the meal. This data is in the wide format, when it should be in the long format. We’ll cover long and wide in ?sec-wrangling.
Some columns that contain a number at the end, likely indicating time, also contain .. instead of .. This likely means it was minus the number (-5, 5 minutes before the test). So it should be renamed to something like _minus5.
The data could actually be split into two or three datasets, one in the long format for the measurements taken at regular minute intervals, one for the measurements taken at 30 minute intervals (the OGTT).
There are two ID columns, id and OFS.ID.
The Group column has abbreviations as FDR and CTR instead of the full names.
There are two similar columns for insulin from the OGTT: Insulin.0.OGTT.X and Insulin.0.OGTT.

18.10 Summary

Use read_csv() or read_csv2() to import data from a .csv (comma or tab separated) file.
Use here::here() to make file paths easier to manage in your R project.
Use glimpse() to get an overview of the data.
Use the Quarto code chunk options like echo, message, or warning to control what is shown in the output of the code chunk.
Use tab auto-completion when writing code.
With tidy data, each variable has its own column, each observation has its own row, and each value has its own cell.
Never edit raw data. Instead, use R code to make changes and clean up the raw data, rather than manually editing the dataset.

18.11 Survey

Please complete the survey for this session: