DataFramesMeta.jl Tutorial
This is a port of the HarvardX series PH525x Genomics class tutorial on dplyr. View the original here and the source here.
What is DataFramesMeta.jl?
DataFramesMeta.jl is a Julia package to transform and summarize tabular data. It uses Julia macros to create a domain-specific language for convenient syntax to work with data frames from DataFrames.jl. DataFramesMeta.jl mirrors concepts in DataFrames.jl as closely as possible, without implementing new features on it's own. For a deeper explanation of DataFramesMeta.jl, see the documentation.
DataFramesMeta.jl is heavily inspired by R's dplyr
. If you are familiar with dplyr
this guide should get you up to speed with DataFramesMeta.jl.
However this tutorial assumes a cursory knowledge of DataFrames.jl. For instance, you should be familiar with the concept of a symbol in Julia (:x
), and how it is used to index a data frame in DataFrames.jl, such as with df[:, :x]
.
Why Is It Useful?
Like dplyr, the DataFramesMeta.jl package contains a set of macros (or "verbs") that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, and adding new columns.
In addition, DataFramesMeta.jl makes it easier to summarize data with the @combine
verb, which implements the split-apply-combine pattern commonly seen in dplyr
and other data manipulation packages.
How Does It Compare To Using Base Functions in Julia and in DataFrames.jl?
If you are familiar with Julia, you are probably familiar with base Julia functions such map
, and broadcast
(akin to *apply
in R). These functions are convenient to use, but are designed to work with arrays, not tabular data.
DataFrames.jl provides the functions select
, transform
, and more to work with data frames. Unlike map
and broadcast
, these functions are designed to work with tabular data, but have a complicated syntax.
DataFramesMeta.jl provides a convenient syntax for working with the vectors in a data frame so that you get all the convenience of Base Julia and DataFrames combined.
How Do I Get DataFramesMeta.jl?
To install DataFramesMeta.jl, which also installs DataFrames.jl:
import Pkg
Pkg.activate(; temp=true) # activate a temporary environment for this tutorial
Pkg.add("DataFramesMeta");
To load DataFramesMeta.jl, which also loads DataFrames.jl:
julia> using DataFramesMeta
For this tutorial, we will install some additional packages as well.
Pkg.add(["CSV", "HTTP"])
Now we load them. We also load the Statistics standard library, which is shipped with Julia, so does not need to be installed.
julia> using CSV, HTTP, Statistics
We will use CSV.jl and HTTP.jl for downloading our dataset from the internet.
Data: Mammals Sleep
The msleep
(mammals sleep) data set contains the sleep times and weights for a set of mammals and is available in the dagdata repository on GitHub. This data set contains 83 rows and 11 variables.
We can load the data directly into a DataFrame from the url
.
julia> url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv"
"https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv"
julia> msleep = CSV.read(HTTP.get(url).body, DataFrame; missingstring="NA")
83×11 DataFrame Row │ name genus vore order cons ⋯ │ String31 String15 String7 String15 Stri ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Cheetah Acinonyx carni Carnivora lc ⋯ 2 │ Owl monkey Aotus omni Primates miss 3 │ Mountain beaver Aplodontia herbi Rodentia nt 4 │ Greater short-tailed shrew Blarina omni Soricomorpha lc 5 │ Cow Bos herbi Artiodactyla dome ⋯ 6 │ Three-toed sloth Bradypus herbi Pilosa miss 7 │ Northern fur seal Callorhinus carni Carnivora vu 8 │ Vesper mouse Calomys missing Rodentia miss ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ 77 │ Brazilian tapir Tapirus herbi Perissodactyla vu ⋯ 78 │ Tenrec Tenrec omni Afrosoricida miss 79 │ Tree shrew Tupaia omni Scandentia miss 80 │ Bottle-nosed dolphin Tursiops carni Cetacea miss 81 │ Genet Genetta carni Carnivora miss ⋯ 82 │ Arctic fox Vulpes carni Carnivora miss 83 │ Red fox Vulpes carni Carnivora miss 7 columns and 68 rows omitted
The columns (in order) correspond to the following:
column name | Description |
---|---|
:name | common name |
:genus | taxonomic rank |
:vore | carnivore, omnivore or herbivore? |
:order | taxonomic rank |
:conservation | the conservation status of the mammal |
:sleep_total | total amount of sleep, in hours |
:sleep_rem | rem sleep, in hours |
:sleep_cycle | length of sleep cycle, in hours |
:awake | amount of time spent awake, in hours |
:brainwt | brain weight in kilograms |
:bodywt | body weight in kilograms |
Important DataFramesMeta.jl Verbs To Remember
Many DataFrames.jl macros come in two forms, a version which operates on columns as a whole and a version which operations row-wise, prefixed by r
.
DataFramesMeta.jl macro | By-row version | Description | dplyr equivalent |
---|---|---|---|
@select | @rselect | select columns | select |
@transform | @rtransform | create new columns | mutate |
@subset | @rsubset | filter rows | filter |
@orderby | @rorderby | re-order or arrange rows | arrange |
@combine | summarise values | summarize (but @combine is more flexible) | |
@groupby | allows for group operations in the "split-apply-combine" concept | group_by |
DataFramesMeta.jl Verbs In Action
Two of the most basic functions are @select
and @subset
, which selects columns and filters rows respectively. To reference columns, use the Symbol
of the column name, i.e. :name
refers to the column msleep.name
.
Selecting Columns Using @select
Select a set of columns: the :name
and the :sleep_total
columns.
julia> @select msleep :name :sleep_total
83×2 DataFrame Row │ name sleep_total │ String31 Float64 ─────┼───────────────────────────────────────── 1 │ Cheetah 12.1 2 │ Owl monkey 17.0 3 │ Mountain beaver 14.4 4 │ Greater short-tailed shrew 14.9 5 │ Cow 4.0 6 │ Three-toed sloth 14.4 7 │ Northern fur seal 8.7 8 │ Vesper mouse 7.0 ⋮ │ ⋮ ⋮ 77 │ Brazilian tapir 4.4 78 │ Tenrec 15.6 79 │ Tree shrew 8.9 80 │ Bottle-nosed dolphin 5.2 81 │ Genet 6.3 82 │ Arctic fox 12.5 83 │ Red fox 9.8 68 rows omitted
If you have a column name stored as a variable, you can select it as a column with the syntax $
.
julia> varname = :sleep_total
:sleep_total
julia> @select msleep :name $varname
83×2 DataFrame Row │ name sleep_total │ String31 Float64 ─────┼───────────────────────────────────────── 1 │ Cheetah 12.1 2 │ Owl monkey 17.0 3 │ Mountain beaver 14.4 4 │ Greater short-tailed shrew 14.9 5 │ Cow 4.0 6 │ Three-toed sloth 14.4 7 │ Northern fur seal 8.7 8 │ Vesper mouse 7.0 ⋮ │ ⋮ ⋮ 77 │ Brazilian tapir 4.4 78 │ Tenrec 15.6 79 │ Tree shrew 8.9 80 │ Bottle-nosed dolphin 5.2 81 │ Genet 6.3 82 │ Arctic fox 12.5 83 │ Red fox 9.8 68 rows omitted
The $
sign has special meaning in DataFramesMeta.jl. We use it for any column reference which is not a symbol. Without it, DataFramesMeta.jl can't tell whether varname
refers to the column :sleep_total
.
You can also use $
to refer to columns with strings
julia> varname = "sleep_total"
"sleep_total"
julia> @select msleep :name $varname
83×2 DataFrame Row │ name sleep_total │ String31 Float64 ─────┼───────────────────────────────────────── 1 │ Cheetah 12.1 2 │ Owl monkey 17.0 3 │ Mountain beaver 14.4 4 │ Greater short-tailed shrew 14.9 5 │ Cow 4.0 6 │ Three-toed sloth 14.4 7 │ Northern fur seal 8.7 8 │ Vesper mouse 7.0 ⋮ │ ⋮ ⋮ 77 │ Brazilian tapir 4.4 78 │ Tenrec 15.6 79 │ Tree shrew 8.9 80 │ Bottle-nosed dolphin 5.2 81 │ Genet 6.3 82 │ Arctic fox 12.5 83 │ Red fox 9.8 68 rows omitted
as well as vectors of variable names
julia> varnames = ["name", "sleep_total"]
2-element Vector{String}: "name" "sleep_total"
julia> @select msleep $varnames
83×2 DataFrame Row │ name sleep_total │ String31 Float64 ─────┼───────────────────────────────────────── 1 │ Cheetah 12.1 2 │ Owl monkey 17.0 3 │ Mountain beaver 14.4 4 │ Greater short-tailed shrew 14.9 5 │ Cow 4.0 6 │ Three-toed sloth 14.4 7 │ Northern fur seal 8.7 8 │ Vesper mouse 7.0 ⋮ │ ⋮ ⋮ 77 │ Brazilian tapir 4.4 78 │ Tenrec 15.6 79 │ Tree shrew 8.9 80 │ Bottle-nosed dolphin 5.2 81 │ Genet 6.3 82 │ Arctic fox 12.5 83 │ Red fox 9.8 68 rows omitted
Similarly, to select the first column, use the syntax $1
.
julia> @select msleep $1
83×1 DataFrame Row │ name │ String31 ─────┼──────────────────────────── 1 │ Cheetah 2 │ Owl monkey 3 │ Mountain beaver 4 │ Greater short-tailed shrew 5 │ Cow 6 │ Three-toed sloth 7 │ Northern fur seal 8 │ Vesper mouse ⋮ │ ⋮ 77 │ Brazilian tapir 78 │ Tenrec 79 │ Tree shrew 80 │ Bottle-nosed dolphin 81 │ Genet 82 │ Arctic fox 83 │ Red fox 68 rows omitted
To select all the columns except a specific column, use the Not
function for inverse selection.
julia> @select msleep Not(:name)
83×10 DataFrame Row │ genus vore order conservation sleep_total sleep ⋯ │ String15 String7 String15 String15 Float64 Float ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Acinonyx carni Carnivora lc 12.1 missi ⋯ 2 │ Aotus omni Primates missing 17.0 3 │ Aplodontia herbi Rodentia nt 14.4 4 │ Blarina omni Soricomorpha lc 14.9 5 │ Bos herbi Artiodactyla domesticated 4.0 ⋯ 6 │ Bradypus herbi Pilosa missing 14.4 7 │ Callorhinus carni Carnivora vu 8.7 8 │ Calomys missing Rodentia missing 7.0 missi ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 77 │ Tapirus herbi Perissodactyla vu 4.4 ⋯ 78 │ Tenrec omni Afrosoricida missing 15.6 79 │ Tupaia omni Scandentia missing 8.9 80 │ Tursiops carni Cetacea missing 5.2 missi 81 │ Genetta carni Carnivora missing 6.3 ⋯ 82 │ Vulpes carni Carnivora missing 12.5 missi 83 │ Vulpes carni Carnivora missing 9.8 5 columns and 68 rows omitted
To select a range of columns by name, use the Between
operator:
julia> @select msleep Between(:name, :order)
83×4 DataFrame Row │ name genus vore order │ String31 String15 String7 String15 ─────┼─────────────────────────────────────────────────────────────────── 1 │ Cheetah Acinonyx carni Carnivora 2 │ Owl monkey Aotus omni Primates 3 │ Mountain beaver Aplodontia herbi Rodentia 4 │ Greater short-tailed shrew Blarina omni Soricomorpha 5 │ Cow Bos herbi Artiodactyla 6 │ Three-toed sloth Bradypus herbi Pilosa 7 │ Northern fur seal Callorhinus carni Carnivora 8 │ Vesper mouse Calomys missing Rodentia ⋮ │ ⋮ ⋮ ⋮ ⋮ 77 │ Brazilian tapir Tapirus herbi Perissodactyla 78 │ Tenrec Tenrec omni Afrosoricida 79 │ Tree shrew Tupaia omni Scandentia 80 │ Bottle-nosed dolphin Tursiops carni Cetacea 81 │ Genet Genetta carni Carnivora 82 │ Arctic fox Vulpes carni Carnivora 83 │ Red fox Vulpes carni Carnivora 68 rows omitted
To select all columns that start with the character string "sl"
use regular expressions in conjunction with Cols
.
julia> @select msleep Cols(r"^sl")
83×3 DataFrame Row │ sleep_total sleep_rem sleep_cycle │ Float64 Float64? Float64? ─────┼──────────────────────────────────────── 1 │ 12.1 missing missing 2 │ 17.0 1.8 missing 3 │ 14.4 2.4 missing 4 │ 14.9 2.3 0.133333 5 │ 4.0 0.7 0.666667 6 │ 14.4 2.2 0.766667 7 │ 8.7 1.4 0.383333 8 │ 7.0 missing missing ⋮ │ ⋮ ⋮ ⋮ 77 │ 4.4 1.0 0.9 78 │ 15.6 2.3 missing 79 │ 8.9 2.6 0.233333 80 │ 5.2 missing missing 81 │ 6.3 1.3 missing 82 │ 12.5 missing missing 83 │ 9.8 2.4 0.35 68 rows omitted
Regular expressions are powerful, but can be difficult for new users to understand. Here are some quick tips.
r"^abc"
= Starts with"abc"
r"abc$"
= Ends with"abc"
r"abc"
= Contains"abc"
anywhere.
Selecting Rows Using @subset
and @rsubset
Filter the rows for mammals that sleep a total of more than 16 hours.
julia> @subset msleep :sleep_total .>= 16
8×11 DataFrame Row │ name genus vore order conserv ⋯ │ String31 String15 String7 String15 String1 ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Owl monkey Aotus omni Primates missing ⋯ 2 │ Long-nosed armadillo Dasypus carni Cingulata lc 3 │ North American Opossum Didelphis omni Didelphimorphia lc 4 │ Big brown bat Eptesicus insecti Chiroptera lc 5 │ Thick-tailed opposum Lutreolina carni Didelphimorphia lc ⋯ 6 │ Little brown bat Myotis insecti Chiroptera missing 7 │ Giant armadillo Priodontes insecti Cingulata en 8 │ Arctic ground squirrel Spermophilus herbi Rodentia lc 7 columns omitted
In the above expression, the .>=
means we "broadcast" the >=
comparison across the whole column. We can use a simpler syntax, @rsubset
which automatically broadcasts all operations.
julia> @rsubset msleep :sleep_total > 16
8×11 DataFrame Row │ name genus vore order conserv ⋯ │ String31 String15 String7 String15 String1 ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Owl monkey Aotus omni Primates missing ⋯ 2 │ Long-nosed armadillo Dasypus carni Cingulata lc 3 │ North American Opossum Didelphis omni Didelphimorphia lc 4 │ Big brown bat Eptesicus insecti Chiroptera lc 5 │ Thick-tailed opposum Lutreolina carni Didelphimorphia lc ⋯ 6 │ Little brown bat Myotis insecti Chiroptera missing 7 │ Giant armadillo Priodontes insecti Cingulata en 8 │ Arctic ground squirrel Spermophilus herbi Rodentia lc 7 columns omitted
Subset the rows for mammals that sleep a total of more than 16 hours and have a body weight of greater than 1 kilogram. For this we put multiple operations on separate lines in a single block.
julia> @rsubset msleep begin :sleep_total >= 16 :bodywt >= 1 end
3×11 DataFrame Row │ name genus vore order conservat ⋯ │ String31 String15 String7 String15 String15 ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Long-nosed armadillo Dasypus carni Cingulata lc ⋯ 2 │ North American Opossum Didelphis omni Didelphimorphia lc 3 │ Giant armadillo Priodontes insecti Cingulata en 7 columns omitted
If you are coming from dplyr
, you can also write the above command in a way that looks more familiar.
julia> @rsubset(msleep, :sleep_total >= 16, :bodywt >= 1)
3×11 DataFrame Row │ name genus vore order conservat ⋯ │ String31 String15 String7 String15 String15 ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Long-nosed armadillo Dasypus carni Cingulata lc ⋯ 2 │ North American Opossum Didelphis omni Didelphimorphia lc 3 │ Giant armadillo Priodontes insecti Cingulata en 7 columns omitted
Filter the rows for mammals in the Perissodactyla and Primates taxonomic order. We wrap code in a let
block to ensure things are fast.
julia> let relevant_orders = Set(["Perissodactyla", "Primates"]) @rsubset msleep :order in relevant_orders end
15×11 DataFrame Row │ name genus vore order conservation ⋯ │ String31 String15 String7 String15 String15 ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Owl monkey Aotus omni Primates missing ⋯ 2 │ Grivet Cercopithecus omni Primates lc 3 │ Horse Equus herbi Perissodactyla domesticated 4 │ Donkey Equus herbi Perissodactyla domesticated 5 │ Patas monkey Erythrocebus omni Primates lc ⋯ 6 │ Galago Galago omni Primates missing 7 │ Human Homo omni Primates missing 8 │ Mongoose lemur Lemur herbi Primates vu 9 │ Macaque Macaca omni Primates missing ⋯ 10 │ Slow loris Nyctibeus carni Primates missing 11 │ Chimpanzee Pan omni Primates missing 12 │ Baboon Papio omni Primates missing 13 │ Potto Perodicticus omni Primates lc ⋯ 14 │ Squirrel monkey Saimiri omni Primates missing 15 │ Brazilian tapir Tapirus herbi Perissodactyla vu 6 columns omitted
You can use the boolean operators (e.g. >, <, >=, <=, !=, in) to create the logical tests.
Chain.jl
Before we go any further, let's introduce the @chain
macro from Chain.jl, which is re-exported by DataFramesMeta.jl. @chain
allows for you to pipe the output from one operation into the input of another operation. The idea of piping is to read the functions from left to right. The syntax and design of @chain
is very similar to %>%
which users of dplyr are familiar with.
To show it's usefulness, let's use @select
and @rsubset
one after the other.
msleep_1 = @select msleep :name :sleep_total
msleep_2 = @rsubset msleep_1 :sleep_total > 16
Now in this case, we will pipe the msleep data frame to the function that will select two columns (:name
and :sleep_total
) and then pipe the new data frame to the @rsubset
operation. This method involves awkwardly creating and naming temporary data frames. We can avoid this with @chain
.
julia> @chain msleep begin @select :name :sleep_total @rsubset :sleep_total > 16 end
8×2 DataFrame Row │ name sleep_total │ String31 Float64 ─────┼───────────────────────────────────── 1 │ Owl monkey 17.0 2 │ Long-nosed armadillo 17.4 3 │ North American Opossum 18.0 4 │ Big brown bat 19.7 5 │ Thick-tailed opposum 19.4 6 │ Little brown bat 19.9 7 │ Giant armadillo 18.1 8 │ Arctic ground squirrel 16.6
You will soon see how useful the @chain
macro is when we start to combine many functions.
Back To dplyr Verbs In Action
Now that you know about the @chain
macro, we will use it throughout the rest of this tutorial.
Arrange Or Re-order Rows Using @orderby
To arrange (or re-order) rows by a particular column, such as the taxonomic order, list the name of the column you want to arrange the rows by:
julia> @orderby msleep :order
83×11 DataFrame Row │ name genus vore order co ⋯ │ String31 String15 String7 String15 St ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Tenrec Tenrec omni Afrosoricida mi ⋯ 2 │ Cow Bos herbi Artiodactyla do 3 │ Roe deer Capreolus herbi Artiodactyla lc 4 │ Goat Capri herbi Artiodactyla lc 5 │ Giraffe Giraffa herbi Artiodactyla cd ⋯ 6 │ Sheep Ovis herbi Artiodactyla do 7 │ Pig Sus omni Artiodactyla do 8 │ Cheetah Acinonyx carni Carnivora lc ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ 77 │ Eastern american chipmunk Tamias herbi Rodentia mi ⋯ 78 │ Tree shrew Tupaia omni Scandentia mi 79 │ Greater short-tailed shrew Blarina omni Soricomorpha lc 80 │ Star-nosed mole Condylura omni Soricomorpha lc 81 │ Lesser short-tailed shrew Cryptotis omni Soricomorpha lc ⋯ 82 │ Eastern american mole Scalopus insecti Soricomorpha lc 83 │ Musk shrew Suncus missing Soricomorpha mi 7 columns and 68 rows omitted
Now we will select three columns from msleep, arrange the rows by the taxonomic order and then arrange the rows by :sleep_total
. Finally, keep the first 10 rows of the data frame.
julia> @chain msleep begin @select :name :order :sleep_total @orderby :order :sleep_total first(10) end
10×3 DataFrame Row │ name order sleep_total │ String31 String15 Float64 ─────┼───────────────────────────────────────── 1 │ Tenrec Afrosoricida 15.6 2 │ Giraffe Artiodactyla 1.9 3 │ Roe deer Artiodactyla 3.0 4 │ Sheep Artiodactyla 3.8 5 │ Cow Artiodactyla 4.0 6 │ Goat Artiodactyla 5.3 7 │ Pig Artiodactyla 9.1 8 │ Caspian seal Carnivora 3.5 9 │ Gray seal Carnivora 6.2 10 │ Genet Carnivora 6.3
The last line of the above block, first(10)
, does not have @
. This is because first
is a Julia function, not a macro, whose names always begin with @
.
Same as above, except here we filter the rows for mammals that sleep for 16 or more hours, instead of showing the head of the final data frame:
julia> @chain msleep begin @select :name :order :sleep_total @orderby :order :sleep_total @rsubset :sleep_total > 16 end
8×3 DataFrame Row │ name order sleep_total │ String31 String15 Float64 ─────┼────────────────────────────────────────────────────── 1 │ Big brown bat Chiroptera 19.7 2 │ Little brown bat Chiroptera 19.9 3 │ Long-nosed armadillo Cingulata 17.4 4 │ Giant armadillo Cingulata 18.1 5 │ North American Opossum Didelphimorphia 18.0 6 │ Thick-tailed opposum Didelphimorphia 19.4 7 │ Owl monkey Primates 17.0 8 │ Arctic ground squirrel Rodentia 16.6
Something slightly more complicated: same as above, except arrange the rows in the :sleep_total
column in a descending order. Since this column is numeric we can do it by negating it:
julia> @chain msleep begin @select :name :order :sleep_total @orderby begin :order -:sleep_total end @rsubset :sleep_total >= 16 end
8×3 DataFrame Row │ name order sleep_total │ String31 String15 Float64 ─────┼────────────────────────────────────────────────────── 1 │ Little brown bat Chiroptera 19.9 2 │ Big brown bat Chiroptera 19.7 3 │ Giant armadillo Cingulata 18.1 4 │ Long-nosed armadillo Cingulata 17.4 5 │ Thick-tailed opposum Didelphimorphia 19.4 6 │ North American Opossum Didelphimorphia 18.0 7 │ Owl monkey Primates 17.0 8 │ Arctic ground squirrel Rodentia 16.6
Alternatively, using the StatsBase.jl package you could have written ordinalrank(:sleep_total, rev=true)
. Finally, if you wanted to avoid using an additional package you could use the following expression invperm(sortperm(:sleep_total, rev=true))
.
Here is a minimal example of sorting string column in reverse:
julia> using StatsBase
julia> df = DataFrame(group=[1, 2, 1, 2, 1], name = ["Bob", "Dexter", "Alice", "Eve", "Cedric"])
5×2 DataFrame Row │ group name │ Int64 String ─────┼─────────────── 1 │ 1 Bob 2 │ 2 Dexter 3 │ 1 Alice 4 │ 2 Eve 5 │ 1 Cedric
julia> @orderby df begin :group ordinalrank(:name, rev=true) end
5×2 DataFrame Row │ group name │ Int64 String ─────┼─────────────── 1 │ 1 Cedric 2 │ 1 Bob 3 │ 1 Alice 4 │ 2 Eve 5 │ 2 Dexter
Create New Columns Using @transform
and @rtransform
The @transform
macro will add new columns to the data frame. Like with other macros, use @rtransform
to operate row-wise. Create a new column called :rem_proportion
, which is the ratio of rem sleep to total amount of sleep.
julia> @rtransform msleep :rem_proportion = :sleep_rem / :sleep_total
83×12 DataFrame Row │ name genus vore order cons ⋯ │ String31 String15 String7 String15 Stri ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Cheetah Acinonyx carni Carnivora lc ⋯ 2 │ Owl monkey Aotus omni Primates miss 3 │ Mountain beaver Aplodontia herbi Rodentia nt 4 │ Greater short-tailed shrew Blarina omni Soricomorpha lc 5 │ Cow Bos herbi Artiodactyla dome ⋯ 6 │ Three-toed sloth Bradypus herbi Pilosa miss 7 │ Northern fur seal Callorhinus carni Carnivora vu 8 │ Vesper mouse Calomys missing Rodentia miss ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ 77 │ Brazilian tapir Tapirus herbi Perissodactyla vu ⋯ 78 │ Tenrec Tenrec omni Afrosoricida miss 79 │ Tree shrew Tupaia omni Scandentia miss 80 │ Bottle-nosed dolphin Tursiops carni Cetacea miss 81 │ Genet Genetta carni Carnivora miss ⋯ 82 │ Arctic fox Vulpes carni Carnivora miss 83 │ Red fox Vulpes carni Carnivora miss 8 columns and 68 rows omitted
You can add many columns at a time by placing multiple operations in a block.
julia> @rtransform msleep begin :rem_proportion = :sleep_rem / :sleep_total :bodywt_grams = :bodywt * 1000 end
83×13 DataFrame Row │ name genus vore order cons ⋯ │ String31 String15 String7 String15 Stri ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Cheetah Acinonyx carni Carnivora lc ⋯ 2 │ Owl monkey Aotus omni Primates miss 3 │ Mountain beaver Aplodontia herbi Rodentia nt 4 │ Greater short-tailed shrew Blarina omni Soricomorpha lc 5 │ Cow Bos herbi Artiodactyla dome ⋯ 6 │ Three-toed sloth Bradypus herbi Pilosa miss 7 │ Northern fur seal Callorhinus carni Carnivora vu 8 │ Vesper mouse Calomys missing Rodentia miss ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ 77 │ Brazilian tapir Tapirus herbi Perissodactyla vu ⋯ 78 │ Tenrec Tenrec omni Afrosoricida miss 79 │ Tree shrew Tupaia omni Scandentia miss 80 │ Bottle-nosed dolphin Tursiops carni Cetacea miss 81 │ Genet Genetta carni Carnivora miss ⋯ 82 │ Arctic fox Vulpes carni Carnivora miss 83 │ Red fox Vulpes carni Carnivora miss 9 columns and 68 rows omitted
Using @transform
instead of @rtransform
will let us work with the column as a whole, and not a single row at a time. Let's create a new variable showing how far an animal's sleep time is from the average of all animals.
julia> @transform msleep :demeand_sleep = :sleep_total .- mean(:sleep_total)
83×12 DataFrame Row │ name genus vore order cons ⋯ │ String31 String15 String7 String15 Stri ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Cheetah Acinonyx carni Carnivora lc ⋯ 2 │ Owl monkey Aotus omni Primates miss 3 │ Mountain beaver Aplodontia herbi Rodentia nt 4 │ Greater short-tailed shrew Blarina omni Soricomorpha lc 5 │ Cow Bos herbi Artiodactyla dome ⋯ 6 │ Three-toed sloth Bradypus herbi Pilosa miss 7 │ Northern fur seal Callorhinus carni Carnivora vu 8 │ Vesper mouse Calomys missing Rodentia miss ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ 77 │ Brazilian tapir Tapirus herbi Perissodactyla vu ⋯ 78 │ Tenrec Tenrec omni Afrosoricida miss 79 │ Tree shrew Tupaia omni Scandentia miss 80 │ Bottle-nosed dolphin Tursiops carni Cetacea miss 81 │ Genet Genetta carni Carnivora miss ⋯ 82 │ Arctic fox Vulpes carni Carnivora miss 83 │ Red fox Vulpes carni Carnivora miss 8 columns and 68 rows omitted
Finally, note that you can create a new column with the name taken from an existing variable, or a new column name with spaces in it, with $
julia> newname = :rem_proportion
:rem_proportion
julia> @rtransform msleep begin $newname = :sleep_rem / :sleep_total $"Body weight in grams" = :bodywt * 1000 end
83×13 DataFrame Row │ name genus vore order cons ⋯ │ String31 String15 String7 String15 Stri ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Cheetah Acinonyx carni Carnivora lc ⋯ 2 │ Owl monkey Aotus omni Primates miss 3 │ Mountain beaver Aplodontia herbi Rodentia nt 4 │ Greater short-tailed shrew Blarina omni Soricomorpha lc 5 │ Cow Bos herbi Artiodactyla dome ⋯ 6 │ Three-toed sloth Bradypus herbi Pilosa miss 7 │ Northern fur seal Callorhinus carni Carnivora vu 8 │ Vesper mouse Calomys missing Rodentia miss ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ 77 │ Brazilian tapir Tapirus herbi Perissodactyla vu ⋯ 78 │ Tenrec Tenrec omni Afrosoricida miss 79 │ Tree shrew Tupaia omni Scandentia miss 80 │ Bottle-nosed dolphin Tursiops carni Cetacea miss 81 │ Genet Genetta carni Carnivora miss ⋯ 82 │ Arctic fox Vulpes carni Carnivora miss 83 │ Red fox Vulpes carni Carnivora miss 9 columns and 68 rows omitted
Create Summaries of the Data Frame using @combine
The @combine
macro will create summary statistics for a given column in the data frame, such as finding the mean. For example, to compute the average number of hours of sleep, apply the mean
function to the column :sleep_total
and call the summary value :avg_sleep
.
julia> @chain msleep @combine :avg_sleep = mean(:sleep_total)
1×1 DataFrame Row │ avg_sleep │ Float64 ─────┼─────────── 1 │ 10.4337
There are many other summary statistics you could consider such std
, minimum
, maximum
, median
, sum
, length
(returns the length of vector), first
(returns first value in vector), and last
(returns last value in vector).
julia> @combine msleep begin :avg_sleep = mean(:sleep_total) :min_sleep = minimum(:sleep_total) :max_sleep = maximum(:sleep_total) :total = length(:sleep_total) end
1×4 DataFrame Row │ avg_sleep min_sleep max_sleep total │ Float64 Float64 Float64 Int64 ─────┼──────────────────────────────────────── 1 │ 10.4337 1.9 19.9 83
DataFrames.jl also provides the function describe
which performs many of these summaries automatically.
julia> describe(msleep)
11×7 DataFrame Row │ variable mean min median max ⋯ │ Symbol Union… Any Union… Any ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ name African elephant Western american ch ⋯ 2 │ genus Acinonyx Vulpes 3 │ vore carni omni 4 │ order Afrosoricida Soricomorpha 5 │ conservation cd vu ⋯ 6 │ sleep_total 10.4337 1.9 10.1 19.9 7 │ sleep_rem 1.87541 0.1 1.5 6.6 8 │ sleep_cycle 0.439583 0.116667 0.333333 1.5 9 │ awake 13.5675 4.1 13.9 22.1 ⋯ 10 │ brainwt 0.281581 0.00014 0.0124 5.712 11 │ bodywt 166.136 0.005 1.67 6654.0 3 columns omitted
Group Operations using @groupby
and @combine
The @groupby
verb is the first step in the "split-apply-combine" workflow. We literally want to split the data frame by some variable (e.g. taxonomic order), apply a function to the individual data frames and then combine the output.
Let's do that: split the msleep
data frame by the taxonomic order, then ask for the same summary statistics as above. We expect a set of summary statistics for each taxonomic order.
julia> @chain msleep begin @groupby :order @combine begin :avg_sleep = mean(:sleep_total) :min_sleep = minimum(:sleep_total) :max_sleep = maximum(:sleep_total) :total = length(:sleep_total) end end
19×5 DataFrame Row │ order avg_sleep min_sleep max_sleep total │ String15 Float64 Float64 Float64 Int64 ─────┼───────────────────────────────────────────────────────── 1 │ Carnivora 10.1167 3.5 15.8 12 2 │ Primates 10.5 8.0 17.0 12 3 │ Rodentia 12.4682 7.0 16.6 22 4 │ Soricomorpha 11.1 8.4 14.9 5 5 │ Artiodactyla 4.51667 1.9 9.1 6 6 │ Pilosa 14.4 14.4 14.4 1 7 │ Cingulata 17.75 17.4 18.1 2 8 │ Hyracoidea 5.66667 5.3 6.3 3 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ 13 │ Erinaceomorpha 10.2 10.1 10.3 2 14 │ Cetacea 4.5 2.7 5.6 3 15 │ Lagomorpha 8.4 8.4 8.4 1 16 │ Diprotodontia 12.4 11.1 13.7 2 17 │ Monotremata 8.6 8.6 8.6 1 18 │ Afrosoricida 15.6 15.6 15.6 1 19 │ Scandentia 8.9 8.9 8.9 1 4 rows omitted
Split-apply-combine can also be used with @transform
to add new variables to a data frame by performing operations by group. For instance, we can de-mean the total hours of sleep of an animal relative to other animals in the same genus.
julia> @chain msleep begin @groupby :order @transform :sleep_genus = :sleep_total .- mean(:sleep_total) end
83×12 DataFrame Row │ name genus vore order cons ⋯ │ String31 String15 String7 String15 Stri ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Cheetah Acinonyx carni Carnivora lc ⋯ 2 │ Owl monkey Aotus omni Primates miss 3 │ Mountain beaver Aplodontia herbi Rodentia nt 4 │ Greater short-tailed shrew Blarina omni Soricomorpha lc 5 │ Cow Bos herbi Artiodactyla dome ⋯ 6 │ Three-toed sloth Bradypus herbi Pilosa miss 7 │ Northern fur seal Callorhinus carni Carnivora vu 8 │ Vesper mouse Calomys missing Rodentia miss ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ 77 │ Brazilian tapir Tapirus herbi Perissodactyla vu ⋯ 78 │ Tenrec Tenrec omni Afrosoricida miss 79 │ Tree shrew Tupaia omni Scandentia miss 80 │ Bottle-nosed dolphin Tursiops carni Cetacea miss 81 │ Genet Genetta carni Carnivora miss ⋯ 82 │ Arctic fox Vulpes carni Carnivora miss 83 │ Red fox Vulpes carni Carnivora miss 8 columns and 68 rows omitted
This short tutorial only touches on the wide array of features in Julia, DataFrames.jl, and DataFramesMeta.jl. Read the full documentation for more information.