{messy} package

Estropeando adrede datasets. Deliberately messing up datasets

{messy} package

El paquete {messy} (de Nicola Rennie) nos permite como docentes poder «ensuciar» bases de datos con el objetivo de que practicar con datasets defectuosos. También puede ser útil para desarrolladores de software (como beta-tester).

 

The {messy} package (by Nicola Rennie) allows us as teachers to «mess up» datasets. It can also be useful for software developers (as beta testers).

 

# devtools::install_github("nrennie/messy")
library(messy)

Practice with starwars

Vamos practicar usando algunas filas y columnas del dataset starwars ({dplyr}).

For practice, we will use some rows and columns from the starwars dataset ({dplyr}).

library(dplyr)
data <- starwars[1:7, c("name", "height", "mass", "eye_color")]
data
# A tibble: 7 × 4
  name               height  mass eye_color
  <chr>               <int> <dbl> <chr>    
1 Luke Skywalker        172    77 blue     
2 C-3PO                 167    75 yellow   
3 R2-D2                  96    32 red      
4 Darth Vader           202   136 yellow   
5 Leia Organa           150    49 brown    
6 Owen Lars             178   120 blue     
7 Beru Whitesun Lars    165    75 blue     

Randomly missings

make_missing() nos permite introducir ausentes (como missing = ...) en las columnas cols. El parámetro messiness mide la probabilidad de cambio.

make_missing() allows us to introduce missing values (as missing = ...) in cols columns. messiness parameter measures the probablity to be changed.

data |> 
  make_missing(cols = c("name", "mass"), messiness = 0.3, missing = NA)
# A tibble: 7 × 4
  name               height  mass eye_color
  <chr>               <int> <dbl> <chr>    
1 Luke Skywalker        172    77 blue     
2 <NA>                  167    75 yellow   
3 R2-D2                  96    NA red      
4 Darth Vader           202   136 yellow   
5 Leia Organa           150    49 brown    
6 <NA>                  178   120 blue     
7 Beru Whitesun Lars    165    NA blue     

Switch title/lowercase

change_case() nos permite intercambiar mayúsculas y minúsculas en variables de texto de manera aleatoria (por palabras o letras según case_type).

change_case() allows us randomly switch between title case and lowercase (by words or letters according to case_type) for character variables.

data |> 
  change_case(cols = c("name", "eye_color"), messiness = 0.3,
              case_type = "letter") # or case_type = "word"
# A tibble: 7 × 4
  name               height  mass eye_color
  <chr>               <int> <dbl> <chr>    
1 luke sKyWaLKEr        172    77 blue     
2 c-3pO                 167    75 YelloW   
3 R2-d2                  96    32 rEd      
4 dARTH VadeR           202   136 YeLlow   
5 lEIa orgaNa           150    49 bRoWn    
6 oweN laRS             178   120 blUe     
7 beru WhiteSUn LaRs    165    75 bluE     

Add special characters

add_special_chars() nos permite añadir caracteres especiales (*, $, !, ?, ^, etc) en variables de texto.

add_special_chars() allows us to introduce special characters (*, $, !, ?, ^, etc) into character variables.

data |> 
  add_special_chars(cols = c("name", "eye_color"), messiness = 0.1)
# A tibble: 7 × 4
  name               height  mass eye_color
  <chr>               <int> <dbl> <chr>    
1 Luk#e Skywalke@r      172    77 blue     
2 C-3#PO                167    75 yellow   
3 R2_-D2                 96    32 red      
4 Darth V&ad+er         202   136 yellow   
5 Leia Organa           150    49 bro@wn   
6 Owen Lars             178   120 blu^e    
7 Beru Whitesun Lars    165    75 blue     

Add white spaces

add_whitespace() nos permite introducir espacios en blanco en variables de texto .

add_whitespace() allows us to introduce white spaces into character variables .

data |> 
  add_whitespace(cols = c("name", "eye_color"), messiness = 0.5)
# A tibble: 7 × 4
  name                 height  mass eye_color
  <chr>                 <int> <dbl> <chr>    
1 "Luke Skywalker"        172    77 "blue "  
2 "C-3PO"                 167    75 "yellow" 
3 "R2-D2 "                 96    32 "red"    
4 "Darth Vader"           202   136 "yellow "
5 "Leia Organa"           150    49 "brown " 
6 "Owen Lars "            178   120 "blue"   
7 "Beru Whitesun Lars"    165    75 "blue "  

Messy variable names

messy_colnames() nos permite añadir caracteres especiales y mayúsculas en nombres de variables.

messy_colnames() allows us to adds randomly special characters and capitalises in column names.

data |> 
  messy_colnames(messiness = 0.2)
# A tibble: 7 × 4
  `na-me`            `heigh%t` `ma(ss` `eye_co%lo)r`
  <chr>                  <int>   <dbl> <chr>        
1 Luke Skywalker           172      77 blue         
2 C-3PO                    167      75 yellow       
3 R2-D2                     96      32 red          
4 Darth Vader              202     136 yellow       
5 Leia Organa              150      49 brown        
6 Owen Lars                178     120 blue         
7 Beru Whitesun Lars       165      75 blue         

In summary

Podemos combinar todo lo anterior con pipes. Puedes ver toda la info en https://nrennie.rbind.io/messy/

We can combine all actions using pipes. You can check all at https://nrennie.rbind.io/messy/

data |> 
  make_missing(cols = c("name", "eye_color"), missing = " ",
               messiness = 0.2) |> 
  make_missing(cols = c("height", "mass"), missing = NA,
               messiness = 0.2) |> 
  add_whitespace(cols = "name", messiness = 0.1) |> 
  add_special_chars(cols = c("name", "eye_color"), messiness = 0.1) |> 
  messy_colnames(messiness = 0.2)
# A tibble: 7 × 4
  `#name`          `HEIGHT `  MASS `eye+_co$lor`
  <chr>                <int> <dbl> <chr>        
1 "Luke Skywalker"       172    77 " "          
2 "C-3PO"                 NA    75 "yellow"     
3 "R2-(D2"                96    NA "red"        
4 "Darth Vade*r"         202   136 "^yellow"    
5 "Leia $Organ.a"        150    49 "brown"      
6 "# "                   178   120 "blue"       
7 "  "                   165    75 "blue"