{messy}
packageEstropeando adrede datasets. Deliberately messing up datasets
El paquete {messy}
(de Nicola Rennie) nos permite como docentes poder «ensuciar» bases de datos con el objetivo de que practicar con datasets defectuosos. También puede ser útil para desarrolladores de software (como beta-tester).
The {messy}
package (by Nicola Rennie) allows us as teachers to «mess up» datasets. It can also be useful for software developers (as beta testers).
Vamos practicar usando algunas filas y columnas del dataset starwars ({dplyr}
).
For practice, we will use some rows and columns from the starwars dataset ({dplyr}
).
# A tibble: 7 × 4
name height mass eye_color
<chr> <int> <dbl> <chr>
1 Luke Skywalker 172 77 blue
2 C-3PO 167 75 yellow
3 R2-D2 96 32 red
4 Darth Vader 202 136 yellow
5 Leia Organa 150 49 brown
6 Owen Lars 178 120 blue
7 Beru Whitesun Lars 165 75 blue
make_missing()
nos permite introducir ausentes (como missing = ...
) en las columnas cols
. El parámetro messiness
mide la probabilidad de cambio.
make_missing()
allows us to introduce missing values (as missing = ...
) in cols
columns. messiness
parameter measures the probablity to be changed.
# A tibble: 7 × 4
name height mass eye_color
<chr> <int> <dbl> <chr>
1 Luke Skywalker 172 77 blue
2 <NA> 167 75 yellow
3 R2-D2 96 NA red
4 Darth Vader 202 136 yellow
5 Leia Organa 150 49 brown
6 <NA> 178 120 blue
7 Beru Whitesun Lars 165 NA blue
change_case()
nos permite intercambiar mayúsculas y minúsculas en variables de texto de manera aleatoria (por palabras o letras según case_type
).
change_case()
allows us randomly switch between title case and lowercase (by words or letters according to case_type
) for character variables.
data |>
change_case(cols = c("name", "eye_color"), messiness = 0.3,
case_type = "letter") # or case_type = "word"
# A tibble: 7 × 4
name height mass eye_color
<chr> <int> <dbl> <chr>
1 luke sKyWaLKEr 172 77 blue
2 c-3pO 167 75 YelloW
3 R2-d2 96 32 rEd
4 dARTH VadeR 202 136 YeLlow
5 lEIa orgaNa 150 49 bRoWn
6 oweN laRS 178 120 blUe
7 beru WhiteSUn LaRs 165 75 bluE
add_special_chars()
nos permite añadir caracteres especiales (*, $, !, ?, ^, etc) en variables de texto.
add_special_chars()
allows us to introduce special characters (*, $, !, ?, ^, etc) into character variables.
# A tibble: 7 × 4
name height mass eye_color
<chr> <int> <dbl> <chr>
1 Luk#e Skywalke@r 172 77 blue
2 C-3#PO 167 75 yellow
3 R2_-D2 96 32 red
4 Darth V&ad+er 202 136 yellow
5 Leia Organa 150 49 bro@wn
6 Owen Lars 178 120 blu^e
7 Beru Whitesun Lars 165 75 blue
add_whitespace()
nos permite introducir espacios en blanco en variables de texto .
add_whitespace()
allows us to introduce white spaces into character variables .
# A tibble: 7 × 4
name height mass eye_color
<chr> <int> <dbl> <chr>
1 "Luke Skywalker" 172 77 "blue "
2 "C-3PO" 167 75 "yellow"
3 "R2-D2 " 96 32 "red"
4 "Darth Vader" 202 136 "yellow "
5 "Leia Organa" 150 49 "brown "
6 "Owen Lars " 178 120 "blue"
7 "Beru Whitesun Lars" 165 75 "blue "
messy_colnames()
nos permite añadir caracteres especiales y mayúsculas en nombres de variables.
messy_colnames()
allows us to adds randomly special characters and capitalises in column names.
# A tibble: 7 × 4
`na-me` `heigh%t` `ma(ss` `eye_co%lo)r`
<chr> <int> <dbl> <chr>
1 Luke Skywalker 172 77 blue
2 C-3PO 167 75 yellow
3 R2-D2 96 32 red
4 Darth Vader 202 136 yellow
5 Leia Organa 150 49 brown
6 Owen Lars 178 120 blue
7 Beru Whitesun Lars 165 75 blue
Podemos combinar todo lo anterior con pipes. Puedes ver toda la info en https://nrennie.rbind.io/messy/
We can combine all actions using pipes. You can check all at https://nrennie.rbind.io/messy/
data |>
make_missing(cols = c("name", "eye_color"), missing = " ",
messiness = 0.2) |>
make_missing(cols = c("height", "mass"), missing = NA,
messiness = 0.2) |>
add_whitespace(cols = "name", messiness = 0.1) |>
add_special_chars(cols = c("name", "eye_color"), messiness = 0.1) |>
messy_colnames(messiness = 0.2)
# A tibble: 7 × 4
`#name` `HEIGHT ` MASS `eye+_co$lor`
<chr> <int> <dbl> <chr>
1 "Luke Skywalker" 172 77 " "
2 "C-3PO" NA 75 "yellow"
3 "R2-(D2" 96 NA "red"
4 "Darth Vade*r" 202 136 "^yellow"
5 "Leia $Organ.a" 150 49 "brown"
6 "# " 178 120 "blue"
7 " " 165 75 "blue"
Javier Álvarez Liébana • @dadosdelaplace • dadosdelaplace.bsky.social