Adding Skimr Spark Histograms in Dataframe Columns

A couple of weeks ago, I was looking for a package, I previously passed by, that prints summary statistics with inline histograms. I checked all my bookmarks and liked tweets, but I couldn’t find it! So I asked on twitter. fortunately Maëlle Salmon read the tweet and guided me to skimr by ropenscilabs, who actually release many useful packages.

In this post, I will focus on spark histograms in summary statistics and beyond. I will give an example of summarizing data in a dataframe, and adding a new column with spark histograms, since I like to play and fit anything in list columns!

skimr Summary Statistics

Basically skimr can show summary statistics, that one can go through quickly, handling different data types (numerics, factors, etc). For example, if we skim storms, and look at one variable, we can see the summaries as follows:

## load libraries
library(tidyverse)
library(skimr)
library(knitr)
library(lubridate)
library(DT)

## Set locale (for windows)
slang <- Sys.setlocale("LC_CTYPE", "Chinese")
## skim storms
storm_skim <- skim(storms)

storm_skim %>% 
  filter(var == "wind") %>% 
  kable()
var type stat level value
wind integer missing .all 0.00000
wind integer complete .all 10010.00000
wind integer n .all 10010.00000
wind integer mean .all 53.49500
wind integer sd .all 26.21387
wind integer min .all 10.00000
wind integer median .all 45.00000
wind integer quantile 25% 30.00000
wind integer quantile 75% 65.00000
wind integer max .all 160.00000
wind integer hist ▂▇▅▃▂▂▁▁▁▁ 0.00000

So we can see the inline histogram, which might be useful while skimming a summary of many variables. And we can extract all the histograms for all the numerical variables as follows:

## print kable
storm_skim %>% 
  filter(stat == "hist") %>% 
  kable()
var type stat level value
year numeric hist ▃▃▃▆▇▆▇▇▇▆ 0
month numeric hist ▁▁▁▁▁▂▅▇▃▁ 0
day integer hist ▇▆▆▆▆▆▆▆▆▆ 0
hour numeric hist ▇▁▇▁▁▇▁▇▁▁ 0
lat numeric hist ▂▇▇▆▇▇▅▂▁▁ 0
long numeric hist ▁▅▇▇▇▆▆▃▁▁ 0
wind integer hist ▂▇▅▃▂▂▁▁▁▁ 0
pressure integer hist ▁▁▁▁▁▁▂▃▇▂ 0
ts_diameter numeric hist ▇▇▅▂▁▁▁▁▁▁ 0
hu_diameter numeric hist ▇▁▁▁▁▁▁▁▁▁ 0

Spark Histograms Beyond Summary Statistics

After exploring skimr, I thought what if I wanted to have spark histograms in a dataframe as a summary for a group?, i.e. something like df %>% group_by(x) %>% summarize(hist = plot_spark_hist()). This is part of my interest in fitting anything in list columns!

So for the next example, I will use Wuzzuf_Job_Posts_Sample.csv dataset released on Kaggle Here. The dataset provides a a sample of jobs posted in 2014-2016 with some details. Here we will focus on 2015 posts.

First: we will parse the post date and extract the job posted in 2015.

## parse date
jobs <- jobs %>% 
  mutate(post_date = as.Date(post_date)) 

## filter jobs in 2015
jobs_2015 <- jobs %>% 
  filter(year(post_date) == 2015) 

Second : We will group by the job_category1 and nest the rest of the data to keep it in the same dataframe. Then we can count the number of jobs in each category and select the top ten.

## select the top ten categories with their details nested in data column
jobs_2015_topx <- jobs_2015 %>% 
  group_by(job_category1) %>% 
  nest() %>% 
  mutate(job_posts = map_int(data, ~nrow(.x))) %>% 
  arrange(desc(job_posts)) %>% 
  filter(between(row_number(), 1, 10))

THEN we will pick any numerical variable (let’s say views) and mutate the dataframe with a new column including the spark histograms. It might seem confusing at first glance, but we can do it as follows:

  • pass the nested data to a formula that skims a dataframe that only includes the target variable views

  • extract the histogram from the result stats

  • extract the level value, which is spark histograms (character after all)

## count views and mutate histograms
jobs_2015_top_hist <- jobs_2015_topx %>% 
  mutate(views_sum = map_int(data, ~sum(.x[["views"]]))) %>% 
  mutate(views_hist = map(data, ~skim(select(.x, views)) %>% 
                            filter(stat == "hist")) %>% 
           map_chr("level")) 

## print kable excluding nested data
jobs_2015_top_hist %>% 
  select(-data) %>% 
  kable()
job_category1 job_posts views_sum views_hist
IT/Software Development 2812 3658423 ▇▇▁▁▁▁▁▁▁▁
Engineering 1674 2817031 ▂▇▃▂▁▁▁▁▁▁
Customer Service/Support 1334 2320181 ▇▅▁▁▁▁▁▁▁▁
Marketing 1201 1446364 ▃▅▇▃▁▁▁▁▁▁
Sales/Retail/Business Development 951 1210833 ▂▇▇▃▁▁▁▁▁▁
Creative/Design 777 1038488 ▂▂▇▆▃▁▁▁▁▁
Quality Assurance/Quality Control 199 405129 ▃▇▆▃▁▁▁▁▁▁
Administration 199 294276 ▅▂▇▇▂▁▂▁▁▁
Editorial/Writing 195 317639 ▅▂▅▇▇▇▃▂▁▁
Installation/Maintenance/Repair 129 220305 ▃▂▇▇▃▂▁▁▁▁

Now, we have a quick way to see the distribution of views in different categories. Similarly, We can add more columns with other variables. This is definitly not enough for good exploration, and one will need to plot real histograms or other plots to gain better understanding of the data. But after all, it is just something to skim the data. And if we can try it, why not?!

comments powered by Disqus