Tidy Tuesday: Mario Kart World Record

I’m finally venturing into the world of Tidy Tuesday. This week is all about Mario Kart.

The Data

The data this week comes from Mario Kart World Records and contains world records for the classic (if you’re a 90’s kid) racing game on the Nintendo 64.

This Video talks about the history of Mario Kart 64 World Records in greater detail. Despite it’s release back in 1996 (1997 in Europe and North America), it is still actiely played by many and new world records are achieved every month.

The game consists of 16 individual tracks and world records can be achieved for the fastest single lap or the fastest completed race (three laps). Also, through the years, players discovered shortcuts in many of the tracks. Fortunately, shortcut and non-shortcut world records are listed separately.

Furthermore, the Nintendo 64 was released for NTSC- and PAL-systems. On PAL-systems, the game runs a little slower. All times in this dataset are PAL-times, but they can be converted back to NTSC-times.

Import data

Read in with tidytuesdayR package. This loads the readme and all the datasets for the week of interest.

library(tidyverse)
# install.packages("tidytuesdayR")
tuesdata <- tidytuesdayR::tt_load('2021-05-25')
## 
##  Downloading file 1 of 2: `drivers.csv`
##  Downloading file 2 of 2: `records.csv`
records <- tuesdata$records
drivers <- tuesdata$drivers

Look at the data

str(records)
## spec_tbl_df [2,334 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ track          : chr [1:2334] "Luigi Raceway" "Luigi Raceway" "Luigi Raceway" "Luigi Raceway" ...
##  $ type           : chr [1:2334] "Three Lap" "Three Lap" "Three Lap" "Three Lap" ...
##  $ shortcut       : chr [1:2334] "No" "No" "No" "No" ...
##  $ player         : chr [1:2334] "Salam" "Booth" "Salam" "Salam" ...
##  $ system_played  : chr [1:2334] "NTSC" "NTSC" "NTSC" "NTSC" ...
##  $ date           : Date[1:2334], format: "1997-02-15" "1997-02-16" ...
##  $ time_period    : chr [1:2334] "2M 12.99S" "2M 9.99S" "2M 8.99S" "2M 6.99S" ...
##  $ time           : num [1:2334] 133 130 129 127 125 ...
##  $ record_duration: num [1:2334] 1 0 12 7 54 0 0 27 0 64 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   track = col_character(),
##   ..   type = col_character(),
##   ..   shortcut = col_character(),
##   ..   player = col_character(),
##   ..   system_played = col_character(),
##   ..   date = col_date(format = ""),
##   ..   time_period = col_character(),
##   ..   time = col_double(),
##   ..   record_duration = col_double()
##   .. )

Variables of interest for me are time which is in seconds, and probably type for the type of track, and shortcut because times will be different if the player used a shortcut.

Question to explore: Which track is the fastest?

Start by looking at a distribution of record times totally overall.

ggplot(records, aes(x=time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Skewed right.. large peak around 45s ish. Makes me wonder if there is one track that is played more often. I bet the different time “groups” are due to different tracks.

How many tracks are there?

table(records$track)
## 
##     Banshee Boardwalk       Bowser's Castle        Choco Mountain 
##                    83                    69                   148 
## D.K.'s Jungle Parkway       Frappe Snowland       Kalimari Desert 
##                   180                   180                   169 
##    Koopa Troopa Beach         Luigi Raceway         Mario Raceway 
##                    89                   147                   160 
##          Moo Moo Farm          Rainbow Road         Royal Raceway 
##                    81                   179                   149 
##          Sherbet Land       Toad's Turnpike         Wario Stadium 
##                   143                   196                   201 
##          Yoshi Valley 
##                   160

16 - not too many.. Banshee Boardwalk and Bowser’s Castle don’t seem to be played much because they don’t have a lot of records. Or perhaps one person dominated the record board and noone can beat them. Not a high turnover world record.

ggplot(records, aes(x=time)) + 
  geom_histogram() + 
  facet_wrap(~track)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Rainbow Road has the highest variability in record times. Three peaks, probably something to do with track type or shortcuts.

Going back to the original question, which track is the fastest, let’s just grab the minimum record time. Find the row with the minimum time.

records[which(records$time == min(records$time)),]
## # A tibble: 1 x 9
##   track       type    shortcut player system_played date       time_period  time
##   <chr>       <chr>   <chr>    <chr>  <chr>         <date>     <chr>       <dbl>
## 1 Wario Stad~ Three ~ Yes      VAJ    NTSC          2020-07-30 14.59S       14.6
## # ... with 1 more variable: record_duration <dbl>

That was a base R solution. Here is a ‘tidyverse’ solution.

records %>%
  arrange(time) %>%
  slice(1)
## # A tibble: 1 x 9
##   track       type    shortcut player system_played date       time_period  time
##   <chr>       <chr>   <chr>    <chr>  <chr>         <date>     <chr>       <dbl>
## 1 Wario Stad~ Three ~ Yes      VAJ    NTSC          2020-07-30 14.59S       14.6
## # ... with 1 more variable: record_duration <dbl>

The fastest track time was on Wario Stadium, using a shortcut.

Side tangent

What is the time distribution separated by shortcut?

ggplot(records, aes(y=time, x=shortcut, fill = shortcut)) + 
  geom_boxplot() + 
  facet_wrap(~track)

Shortcut doesn’t help Kalimari Desert…spread. The median is very different. Let’s look one more time at distribution, but switch to density so we can see overlap.

ggplot(records, aes(x=time, color = shortcut)) + 
  geom_density() + 
  facet_wrap(~track, scales = "free")

This also shows us that shortcuts were discovered on all tracks except:

  • Banshee Boardwalk
  • Bowser’s Castle
  • Koopa Troopa Beach
  • Moo Moo Farm

How does tracks relate to players? Do players have favorite tracks?

How many players?

length(unique(records$player))
## [1] 65

Lets look at players with only a certain amount of world records.

records %>% 
  group_by(player) %>%
  summarise(n=n()) %>% 
  arrange(desc(n)) %>% 
  ggplot(aes(x=n)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s only look at the players with more than 100 world records. The ELITE!

top.players <- records %>% 
  group_by(player) %>%
  summarise(n=n()) %>% 
  filter(n>100)

top.players
## # A tibble: 6 x 2
##   player       n
##   <chr>    <int>
## 1 abney317   118
## 2 Booth      141
## 3 Dan        201
## 4 MJ         197
## 5 MR         351
## 6 Penev      371

We have 6 players. Let’s get their track data.

top.player.tracks <- top.players %>%
  left_join(records)
## Joining, by = "player"
head(top.player.tracks)
## # A tibble: 6 x 10
##   player      n track  type  shortcut system_played date       time_period  time
##   <chr>   <int> <chr>  <chr> <chr>    <chr>         <date>     <chr>       <dbl>
## 1 abney3~   118 Luigi~ Thre~ Yes      NTSC          2016-03-22 1M 29.94S    89.9
## 2 abney3~   118 Luigi~ Thre~ Yes      NTSC          2016-03-24 1M 27.45S    87.4
## 3 abney3~   118 Luigi~ Thre~ Yes      NTSC          2021-02-09 44.97S       45.0
## 4 abney3~   118 Luigi~ Thre~ Yes      NTSC          2021-02-09 44.45S       44.4
## 5 abney3~   118 Luigi~ Thre~ Yes      NTSC          2021-02-09 42.47S       42.5
## 6 abney3~   118 Luigi~ Thre~ Yes      NTSC          2021-02-09 39.05S       39.0
## # ... with 1 more variable: record_duration <dbl>

Did they all use shortcuts?

table(top.player.tracks$shortcut)
## 
##  No Yes 
## 912 467

no! Did they all use the same system?

table(top.player.tracks$system_played)
## 
## NTSC  PAL 
##  198 1181

Look at distribution of track times by track and player

ggplot(top.player.tracks, aes(x=time, color=shortcut)) + 
  geom_density() + 
  facet_grid(track ~ player, scales="free")
## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

Hard to read, let’s change to one line per player.

ggplot(top.player.tracks, aes(x=time, color=player)) + 
  geom_density() + 
  facet_grid(track ~ shortcut, scales="free")
## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

Look at shortcuts vs not totally separately so that the panels can wrap and be more visible. Especially for Penev. I can’t see what they’re doing at all.

top.player.tracks %>%
  filter(shortcut == "No") %>% 
  ggplot(aes(x=time, color=player)) + 
    geom_density() + 
    facet_wrap(~track, scales="free")
## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

Things I noticed:

  • There’s still two peaks for most maps, even though we’re looking at records that were won not using a shortcut. So there is something else going on that affects track time.
  • PENEV dominated Yoshi Valley
  • Only MR and PENEV play Toad’s Turnpike
  • There are 7 maps or so that don’t have high turnover with records.

look closer at Banshee Boardwalk

top.player.tracks %>%
  filter(shortcut == "No", 
         track == "Banshee Boardwalk") %>% 
  ggplot(aes(x=type, y=time, color=player)) + 
  geom_point()

That’s exactly what was driving the two distinct peaks for most if not all maps. That’s something we should have taken into consideration earlier on.

Robin Donatello
Robin Donatello
Associate Professor of Statistics and Data Science

My research interests are often in the field of Public Health, Education and Student Success. I enjoy using data to help others make the world a better place.

Related