DataFramesMeta.jl Tutorial

This is a port of the HarvardX series PH525x Genomics class tutorial on dplyr. View the original here and the source here.

What is DataFramesMeta.jl?

DataFramesMeta.jl is a Julia package to transform and summarize tabular data. It uses Julia macros to create a domain-specific language for convenient syntax to work with data frames from DataFrames.jl. DataFramesMeta.jl mirrors concepts in DataFrames.jl as closely as possible, without implementing new features on it's own. For a deeper explanation of DataFramesMeta.jl, see the documentation.

DataFramesMeta.jl is heavily inspired by R's dplyr. If you are familiar with dplyr this guide should get you up to speed with DataFramesMeta.jl.

However this tutorial assumes a cursory knowledge of DataFrames.jl. For instance, you should be familiar with the concept of a symbol in Julia (:x), and how it is used to index a data frame in DataFrames.jl, such as with df[:, :x].

Why Is It Useful?

Like dplyr, the DataFramesMeta.jl package contains a set of macros (or "verbs") that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, and adding new columns.

In addition, DataFramesMeta.jl makes it easier to summarize data with the @combine verb, which implements the split-apply-combine pattern commonly seen in dplyr and other data manipulation packages.

How Does It Compare To Using Base Functions in Julia and in DataFrames.jl?

If you are familiar with Julia, you are probably familiar with base Julia functions such map, and broadcast (akin to *apply in R). These functions are convenient to use, but are designed to work with arrays, not tabular data.

DataFrames.jl provides the functions select, transform, and more to work with data frames. Unlike map and broadcast, these functions are designed to work with tabular data, but have a complicated syntax.

DataFramesMeta.jl provides a convenient syntax for working with the vectors in a data frame so that you get all the convenience of Base Julia and DataFrames combined.

How Do I Get DataFramesMeta.jl?

To install DataFramesMeta.jl, which also installs DataFrames.jl:

import Pkg
Pkg.activate(; temp=true) # activate a temporary environment for this tutorial
Pkg.add("DataFramesMeta");

To load DataFramesMeta.jl, which also loads DataFrames.jl:

julia> using DataFramesMeta

For this tutorial, we will install some additional packages as well.

Pkg.add(["CSV", "HTTP"])

Now we load them. We also load the Statistics standard library, which is shipped with Julia, so does not need to be installed.

julia> using CSV, HTTP, Statistics

We will use CSV.jl and HTTP.jl for downloading our dataset from the internet.

Data: Mammals Sleep

The msleep (mammals sleep) data set contains the sleep times and weights for a set of mammals and is available in the dagdata repository on GitHub. This data set contains 83 rows and 11 variables.

We can load the data directly into a DataFrame from the url.

julia> url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv""https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv"
julia> msleep = CSV.read(HTTP.get(url).body, DataFrame; missingstring="NA")83×11 DataFrame
 Row │ name                        genus         vore      order           con ⋯
     │ String31                    String15      String7?  String15        Str ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni     Carnivora       lc  ⋯
   2 │ Owl monkey                  Aotus         omni      Primates        mis
   3 │ Mountain beaver             Aplodontia    herbi     Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina       omni      Soricomorpha    lc
   5 │ Cow                         Bos           herbi     Artiodactyla    dom ⋯
   6 │ Three-toed sloth            Bradypus      herbi     Pilosa          mis
   7 │ Northern fur seal           Callorhinus   carni     Carnivora       vu
   8 │ Vesper mouse                Calomys       missing   Rodentia        mis
  ⋮  │             ⋮                    ⋮           ⋮            ⋮             ⋱
  77 │ Brazilian tapir             Tapirus       herbi     Perissodactyla  vu  ⋯
  78 │ Tenrec                      Tenrec        omni      Afrosoricida    mis
  79 │ Tree shrew                  Tupaia        omni      Scandentia      mis
  80 │ Bottle-nosed dolphin        Tursiops      carni     Cetacea         mis
  81 │ Genet                       Genetta       carni     Carnivora       mis ⋯
  82 │ Arctic fox                  Vulpes        carni     Carnivora       mis
  83 │ Red fox                     Vulpes        carni     Carnivora       mis
                                                   7 columns and 68 rows omitted

The columns (in order) correspond to the following:

column name	Description
`:name`	common name
`:genus`	taxonomic rank
`:vore`	carnivore, omnivore or herbivore?
`:order`	taxonomic rank
`:conservation`	the conservation status of the mammal
`:sleep_total`	total amount of sleep, in hours
`:sleep_rem`	rem sleep, in hours
`:sleep_cycle`	length of sleep cycle, in hours
`:awake`	amount of time spent awake, in hours
`:brainwt`	brain weight in kilograms
`:bodywt`	body weight in kilograms

Important DataFramesMeta.jl Verbs To Remember

Many DataFrames.jl macros come in two forms, a version which operates on columns as a whole and a version which operations row-wise, prefixed by r.

DataFramesMeta.jl macro	By-row version	Description	`dplyr` equivalent
`@select`	`@rselect`	select columns	`select`
`@transform`	`@rtransform`	create new columns	`mutate`
`@subset`	`@rsubset`	filter rows	`filter`
`@orderby`	`@rorderby`	re-order or arrange rows	`arrange`
`@combine`		summarise values	`summarize` (but `@combine` is more flexible)
`@groupby`		allows for group operations in the "split-apply-combine" concept	`group_by`

DataFramesMeta.jl Verbs In Action

Two of the most basic functions are @select and @subset, which selects columns and filters rows respectively. To reference columns, use the Symbol of the column name, i.e. :name refers to the column msleep.name.

Selecting Columns Using `@select`

Select a set of columns: the :name and the :sleep_total columns.

julia> @select msleep :name :sleep_total83×2 DataFrame
 Row │ name                        sleep_total
     │ String31                    Float64
─────┼─────────────────────────────────────────
   1 │ Cheetah                            12.1
   2 │ Owl monkey                         17.0
   3 │ Mountain beaver                    14.4
   4 │ Greater short-tailed shrew         14.9
   5 │ Cow                                 4.0
   6 │ Three-toed sloth                   14.4
   7 │ Northern fur seal                   8.7
   8 │ Vesper mouse                        7.0
  ⋮  │             ⋮                    ⋮
  77 │ Brazilian tapir                     4.4
  78 │ Tenrec                             15.6
  79 │ Tree shrew                          8.9
  80 │ Bottle-nosed dolphin                5.2
  81 │ Genet                               6.3
  82 │ Arctic fox                         12.5
  83 │ Red fox                             9.8
                                68 rows omitted

If you have a column name stored as a variable, you can select it as a column with the syntax $.

julia> varname = :sleep_total:sleep_total
julia> @select msleep :name $varname83×2 DataFrame
 Row │ name                        sleep_total
     │ String31                    Float64
─────┼─────────────────────────────────────────
   1 │ Cheetah                            12.1
   2 │ Owl monkey                         17.0
   3 │ Mountain beaver                    14.4
   4 │ Greater short-tailed shrew         14.9
   5 │ Cow                                 4.0
   6 │ Three-toed sloth                   14.4
   7 │ Northern fur seal                   8.7
   8 │ Vesper mouse                        7.0
  ⋮  │             ⋮                    ⋮
  77 │ Brazilian tapir                     4.4
  78 │ Tenrec                             15.6
  79 │ Tree shrew                          8.9
  80 │ Bottle-nosed dolphin                5.2
  81 │ Genet                               6.3
  82 │ Arctic fox                         12.5
  83 │ Red fox                             9.8
                                68 rows omitted

The $ sign has special meaning in DataFramesMeta.jl. We use it for any column reference which is not a symbol. Without it, DataFramesMeta.jl can't tell whether varname refers to the column :sleep_total.

You can also use $ to refer to columns with strings

julia> varname = "sleep_total""sleep_total"
julia> @select msleep :name $varname83×2 DataFrame
 Row │ name                        sleep_total
     │ String31                    Float64
─────┼─────────────────────────────────────────
   1 │ Cheetah                            12.1
   2 │ Owl monkey                         17.0
   3 │ Mountain beaver                    14.4
   4 │ Greater short-tailed shrew         14.9
   5 │ Cow                                 4.0
   6 │ Three-toed sloth                   14.4
   7 │ Northern fur seal                   8.7
   8 │ Vesper mouse                        7.0
  ⋮  │             ⋮                    ⋮
  77 │ Brazilian tapir                     4.4
  78 │ Tenrec                             15.6
  79 │ Tree shrew                          8.9
  80 │ Bottle-nosed dolphin                5.2
  81 │ Genet                               6.3
  82 │ Arctic fox                         12.5
  83 │ Red fox                             9.8
                                68 rows omitted

as well as vectors of variable names

julia> varnames = ["name", "sleep_total"]2-element Vector{String}:
 "name"
 "sleep_total"
julia> @select msleep $varnames83×2 DataFrame
 Row │ name                        sleep_total
     │ String31                    Float64
─────┼─────────────────────────────────────────
   1 │ Cheetah                            12.1
   2 │ Owl monkey                         17.0
   3 │ Mountain beaver                    14.4
   4 │ Greater short-tailed shrew         14.9
   5 │ Cow                                 4.0
   6 │ Three-toed sloth                   14.4
   7 │ Northern fur seal                   8.7
   8 │ Vesper mouse                        7.0
  ⋮  │             ⋮                    ⋮
  77 │ Brazilian tapir                     4.4
  78 │ Tenrec                             15.6
  79 │ Tree shrew                          8.9
  80 │ Bottle-nosed dolphin                5.2
  81 │ Genet                               6.3
  82 │ Arctic fox                         12.5
  83 │ Red fox                             9.8
                                68 rows omitted

Similarly, to select the first column, use the syntax $1.

julia> @select msleep $183×1 DataFrame
 Row │ name
     │ String31
─────┼────────────────────────────
   1 │ Cheetah
   2 │ Owl monkey
   3 │ Mountain beaver
   4 │ Greater short-tailed shrew
   5 │ Cow
   6 │ Three-toed sloth
   7 │ Northern fur seal
   8 │ Vesper mouse
  ⋮  │             ⋮
  77 │ Brazilian tapir
  78 │ Tenrec
  79 │ Tree shrew
  80 │ Bottle-nosed dolphin
  81 │ Genet
  82 │ Arctic fox
  83 │ Red fox
                   68 rows omitted

To select all the columns except a specific column, use the Not function for inverse selection.

julia> @select msleep Not(:name)83×10 DataFrame
 Row │ genus         vore      order           conservation  sleep_total  slee ⋯
     │ String15      String7?  String15        String15?     Float64      Floa ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Acinonyx      carni     Carnivora       lc                   12.1  miss ⋯
   2 │ Aotus         omni      Primates        missing              17.0
   3 │ Aplodontia    herbi     Rodentia        nt                   14.4
   4 │ Blarina       omni      Soricomorpha    lc                   14.9
   5 │ Bos           herbi     Artiodactyla    domesticated          4.0       ⋯
   6 │ Bradypus      herbi     Pilosa          missing              14.4
   7 │ Callorhinus   carni     Carnivora       vu                    8.7
   8 │ Calomys       missing   Rodentia        missing               7.0  miss
  ⋮  │      ⋮           ⋮            ⋮              ⋮             ⋮            ⋱
  77 │ Tapirus       herbi     Perissodactyla  vu                    4.4       ⋯
  78 │ Tenrec        omni      Afrosoricida    missing              15.6
  79 │ Tupaia        omni      Scandentia      missing               8.9
  80 │ Tursiops      carni     Cetacea         missing               5.2  miss
  81 │ Genetta       carni     Carnivora       missing               6.3       ⋯
  82 │ Vulpes        carni     Carnivora       missing              12.5  miss
  83 │ Vulpes        carni     Carnivora       missing               9.8
                                                   5 columns and 68 rows omitted

To select a range of columns by name, use the Between operator:

julia> @select msleep Between(:name, :order)83×4 DataFrame
 Row │ name                        genus         vore      order
     │ String31                    String15      String7?  String15
─────┼────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni     Carnivora
   2 │ Owl monkey                  Aotus         omni      Primates
   3 │ Mountain beaver             Aplodontia    herbi     Rodentia
   4 │ Greater short-tailed shrew  Blarina       omni      Soricomorpha
   5 │ Cow                         Bos           herbi     Artiodactyla
   6 │ Three-toed sloth            Bradypus      herbi     Pilosa
   7 │ Northern fur seal           Callorhinus   carni     Carnivora
   8 │ Vesper mouse                Calomys       missing   Rodentia
  ⋮  │             ⋮                    ⋮           ⋮            ⋮
  77 │ Brazilian tapir             Tapirus       herbi     Perissodactyla
  78 │ Tenrec                      Tenrec        omni      Afrosoricida
  79 │ Tree shrew                  Tupaia        omni      Scandentia
  80 │ Bottle-nosed dolphin        Tursiops      carni     Cetacea
  81 │ Genet                       Genetta       carni     Carnivora
  82 │ Arctic fox                  Vulpes        carni     Carnivora
  83 │ Red fox                     Vulpes        carni     Carnivora
                                                           68 rows omitted

To select all columns that start with the character string "sl" use regular expressions in conjunction with Cols.

julia> @select msleep Cols(r"^sl")83×3 DataFrame
 Row │ sleep_total  sleep_rem  sleep_cycle
     │ Float64      Float64?   Float64?
─────┼────────────────────────────────────────
   1 │        12.1  missing    missing
   2 │        17.0        1.8  missing
   3 │        14.4        2.4  missing
   4 │        14.9        2.3        0.133333
   5 │         4.0        0.7        0.666667
   6 │        14.4        2.2        0.766667
   7 │         8.7        1.4        0.383333
   8 │         7.0  missing    missing
  ⋮  │      ⋮           ⋮            ⋮
  77 │         4.4        1.0        0.9
  78 │        15.6        2.3  missing
  79 │         8.9        2.6        0.233333
  80 │         5.2  missing    missing
  81 │         6.3        1.3  missing
  82 │        12.5  missing    missing
  83 │         9.8        2.4        0.35
                               68 rows omitted

Regular expressions are powerful, but can be difficult for new users to understand. Here are some quick tips.

r"^abc" = Starts with "abc"
r"abc$" = Ends with "abc"
r"abc" = Contains "abc" anywhere.

Selecting Rows Using `@subset` and `@rsubset`

Filter the rows for mammals that sleep a total of more than 16 hours.

julia> @subset msleep :sleep_total .>= 168×11 DataFrame
 Row │ name                    genus         vore      order            conser ⋯
     │ String31                String15      String7?  String15         String ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Owl monkey              Aotus         omni      Primates         missin ⋯
   2 │ Long-nosed armadillo    Dasypus       carni     Cingulata        lc
   3 │ North American Opossum  Didelphis     omni      Didelphimorphia  lc
   4 │ Big brown bat           Eptesicus     insecti   Chiroptera       lc
   5 │ Thick-tailed opposum    Lutreolina    carni     Didelphimorphia  lc     ⋯
   6 │ Little brown bat        Myotis        insecti   Chiroptera       missin
   7 │ Giant armadillo         Priodontes    insecti   Cingulata        en
   8 │ Arctic ground squirrel  Spermophilus  herbi     Rodentia         lc
                                                               7 columns omitted

In the above expression, the .>= means we "broadcast" the >= comparison across the whole column. We can use a simpler syntax, @rsubset which automatically broadcasts all operations.

julia> @rsubset msleep :sleep_total > 168×11 DataFrame
 Row │ name                    genus         vore      order            conser ⋯
     │ String31                String15      String7?  String15         String ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Owl monkey              Aotus         omni      Primates         missin ⋯
   2 │ Long-nosed armadillo    Dasypus       carni     Cingulata        lc
   3 │ North American Opossum  Didelphis     omni      Didelphimorphia  lc
   4 │ Big brown bat           Eptesicus     insecti   Chiroptera       lc
   5 │ Thick-tailed opposum    Lutreolina    carni     Didelphimorphia  lc     ⋯
   6 │ Little brown bat        Myotis        insecti   Chiroptera       missin
   7 │ Giant armadillo         Priodontes    insecti   Cingulata        en
   8 │ Arctic ground squirrel  Spermophilus  herbi     Rodentia         lc
                                                               7 columns omitted

Subset the rows for mammals that sleep a total of more than 16 hours and have a body weight of greater than 1 kilogram. For this we put multiple operations on separate lines in a single block.

julia> @rsubset msleep begin
           :sleep_total >= 16
           :bodywt >= 1
       end3×11 DataFrame
 Row │ name                    genus       vore      order            conserva ⋯
     │ String31                String15    String7?  String15         String15 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Long-nosed armadillo    Dasypus     carni     Cingulata        lc       ⋯
   2 │ North American Opossum  Didelphis   omni      Didelphimorphia  lc
   3 │ Giant armadillo         Priodontes  insecti   Cingulata        en
                                                               7 columns omitted

If you are coming from dplyr, you can also write the above command in a way that looks more familiar.

julia> @rsubset(msleep, :sleep_total >= 16, :bodywt >= 1)3×11 DataFrame
 Row │ name                    genus       vore      order            conserva ⋯
     │ String31                String15    String7?  String15         String15 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Long-nosed armadillo    Dasypus     carni     Cingulata        lc       ⋯
   2 │ North American Opossum  Didelphis   omni      Didelphimorphia  lc
   3 │ Giant armadillo         Priodontes  insecti   Cingulata        en
                                                               7 columns omitted

Filter the rows for mammals in the Perissodactyla and Primates taxonomic order. We wrap code in a let block to ensure things are fast.

julia> let
           relevant_orders = Set(["Perissodactyla", "Primates"])
           @rsubset msleep :order in relevant_orders
       end15×11 DataFrame
 Row │ name             genus          vore      order           conservation  ⋯
     │ String31         String15       String7?  String15        String15?     ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Owl monkey       Aotus          omni      Primates        missing       ⋯
   2 │ Grivet           Cercopithecus  omni      Primates        lc
   3 │ Horse            Equus          herbi     Perissodactyla  domesticated
   4 │ Donkey           Equus          herbi     Perissodactyla  domesticated
   5 │ Patas monkey     Erythrocebus   omni      Primates        lc            ⋯
   6 │ Galago           Galago         omni      Primates        missing
   7 │ Human            Homo           omni      Primates        missing
   8 │ Mongoose lemur   Lemur          herbi     Primates        vu
   9 │ Macaque          Macaca         omni      Primates        missing       ⋯
  10 │ Slow loris       Nyctibeus      carni     Primates        missing
  11 │ Chimpanzee       Pan            omni      Primates        missing
  12 │ Baboon           Papio          omni      Primates        missing
  13 │ Potto            Perodicticus   omni      Primates        lc            ⋯
  14 │ Squirrel monkey  Saimiri        omni      Primates        missing
  15 │ Brazilian tapir  Tapirus        herbi     Perissodactyla  vu
                                                               6 columns omitted

You can use the boolean operators (e.g. >, <, >=, <=, !=, in) to create the logical tests.

Chain.jl

Before we go any further, let's introduce the @chain macro from Chain.jl, which is re-exported by DataFramesMeta.jl. @chain allows for you to pipe the output from one operation into the input of another operation. The idea of piping is to read the functions from left to right. The syntax and design of @chain is very similar to %>% which users of dplyr are familiar with.

To show it's usefulness, let's use @select and @rsubset one after the other.

msleep_1 = @select msleep :name :sleep_total
msleep_2 = @rsubset msleep_1 :sleep_total > 16

Now in this case, we will pipe the msleep data frame to the function that will select two columns (:name and :sleep_total) and then pipe the new data frame to the @rsubset operation. This method involves awkwardly creating and naming temporary data frames. We can avoid this with @chain.

julia> @chain msleep begin
           @select :name :sleep_total
           @rsubset :sleep_total > 16
       end8×2 DataFrame
 Row │ name                    sleep_total
     │ String31                Float64
─────┼─────────────────────────────────────
   1 │ Owl monkey                     17.0
   2 │ Long-nosed armadillo           17.4
   3 │ North American Opossum         18.0
   4 │ Big brown bat                  19.7
   5 │ Thick-tailed opposum           19.4
   6 │ Little brown bat               19.9
   7 │ Giant armadillo                18.1
   8 │ Arctic ground squirrel         16.6

You will soon see how useful the @chain macro is when we start to combine many functions.

Back To dplyr Verbs In Action

Now that you know about the @chain macro, we will use it throughout the rest of this tutorial.

Arrange Or Re-order Rows Using `@orderby`

To arrange (or re-order) rows by a particular column, such as the taxonomic order, list the name of the column you want to arrange the rows by:

julia> @orderby msleep :order83×11 DataFrame
 Row │ name                            genus         vore      order         c ⋯
     │ String31                        String15      String7?  String15      S ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Tenrec                          Tenrec        omni      Afrosoricida  m ⋯
   2 │ Cow                             Bos           herbi     Artiodactyla  d
   3 │ Roe deer                        Capreolus     herbi     Artiodactyla  l
   4 │ Goat                            Capri         herbi     Artiodactyla  l
   5 │ Giraffe                         Giraffa       herbi     Artiodactyla  c ⋯
   6 │ Sheep                           Ovis          herbi     Artiodactyla  d
   7 │ Pig                             Sus           omni      Artiodactyla  d
   8 │ Cheetah                         Acinonyx      carni     Carnivora     l
  ⋮  │               ⋮                      ⋮           ⋮           ⋮          ⋱
  77 │ Eastern american chipmunk       Tamias        herbi     Rodentia      m ⋯
  78 │ Tree shrew                      Tupaia        omni      Scandentia    m
  79 │ Greater short-tailed shrew      Blarina       omni      Soricomorpha  l
  80 │ Star-nosed mole                 Condylura     omni      Soricomorpha  l
  81 │ Lesser short-tailed shrew       Cryptotis     omni      Soricomorpha  l ⋯
  82 │ Eastern american mole           Scalopus      insecti   Soricomorpha  l
  83 │ Musk shrew                      Suncus        missing   Soricomorpha  m
                                                   7 columns and 68 rows omitted

Now we will select three columns from msleep, arrange the rows by the taxonomic order and then arrange the rows by :sleep_total. Finally, keep the first 10 rows of the data frame.

julia> @chain msleep begin
           @select :name :order :sleep_total
           @orderby :order :sleep_total
           first(10)
       end10×3 DataFrame
 Row │ name          order         sleep_total
     │ String31      String15      Float64
─────┼─────────────────────────────────────────
   1 │ Tenrec        Afrosoricida         15.6
   2 │ Giraffe       Artiodactyla          1.9
   3 │ Roe deer      Artiodactyla          3.0
   4 │ Sheep         Artiodactyla          3.8
   5 │ Cow           Artiodactyla          4.0
   6 │ Goat          Artiodactyla          5.3
   7 │ Pig           Artiodactyla          9.1
   8 │ Caspian seal  Carnivora             3.5
   9 │ Gray seal     Carnivora             6.2
  10 │ Genet         Carnivora             6.3

The last line of the above block, first(10), does not have @. This is because first is a Julia function, not a macro, whose names always begin with @.

Same as above, except here we filter the rows for mammals that sleep for 16 or more hours, instead of showing the head of the final data frame:

julia> @chain msleep begin
           @select :name :order :sleep_total
           @orderby :order :sleep_total
           @rsubset :sleep_total > 16
       end8×3 DataFrame
 Row │ name                    order            sleep_total
     │ String31                String15         Float64
─────┼──────────────────────────────────────────────────────
   1 │ Big brown bat           Chiroptera              19.7
   2 │ Little brown bat        Chiroptera              19.9
   3 │ Long-nosed armadillo    Cingulata               17.4
   4 │ Giant armadillo         Cingulata               18.1
   5 │ North American Opossum  Didelphimorphia         18.0
   6 │ Thick-tailed opposum    Didelphimorphia         19.4
   7 │ Owl monkey              Primates                17.0
   8 │ Arctic ground squirrel  Rodentia                16.6

Something slightly more complicated: same as above, except arrange the rows in the :sleep_total column in a descending order. Since this column is numeric we can do it by negating it:

julia> @chain msleep begin
           @select :name :order :sleep_total
           @orderby begin
               :order
               -:sleep_total
           end
           @rsubset :sleep_total >= 16
       end8×3 DataFrame
 Row │ name                    order            sleep_total
     │ String31                String15         Float64
─────┼──────────────────────────────────────────────────────
   1 │ Little brown bat        Chiroptera              19.9
   2 │ Big brown bat           Chiroptera              19.7
   3 │ Giant armadillo         Cingulata               18.1
   4 │ Long-nosed armadillo    Cingulata               17.4
   5 │ Thick-tailed opposum    Didelphimorphia         19.4
   6 │ North American Opossum  Didelphimorphia         18.0
   7 │ Owl monkey              Primates                17.0
   8 │ Arctic ground squirrel  Rodentia                16.6

Alternatively, using the StatsBase.jl package you could have written ordinalrank(:sleep_total, rev=true). Finally, if you wanted to avoid using an additional package you could use the following expression invperm(sortperm(:sleep_total, rev=true)).

Here is a minimal example of sorting string column in reverse:

julia> using StatsBase
julia> df = DataFrame(group=[1, 2, 1, 2, 1], name = ["Bob", "Dexter", "Alice", "Eve", "Cedric"])5×2 DataFrame
 Row │ group  name
     │ Int64  String
─────┼───────────────
   1 │     1  Bob
   2 │     2  Dexter
   3 │     1  Alice
   4 │     2  Eve
   5 │     1  Cedric
julia> @orderby df begin
           :group
           ordinalrank(:name, rev=true)
       end5×2 DataFrame
 Row │ group  name
     │ Int64  String
─────┼───────────────
   1 │     1  Cedric
   2 │     1  Bob
   3 │     1  Alice
   4 │     2  Eve
   5 │     2  Dexter

Create New Columns Using `@transform` and `@rtransform`

The @transform macro will add new columns to the data frame. Like with other macros, use @rtransform to operate row-wise. Create a new column called :rem_proportion, which is the ratio of rem sleep to total amount of sleep.

julia> @rtransform msleep :rem_proportion = :sleep_rem / :sleep_total83×12 DataFrame
 Row │ name                        genus         vore      order           con ⋯
     │ String31                    String15      String7?  String15        Str ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni     Carnivora       lc  ⋯
   2 │ Owl monkey                  Aotus         omni      Primates        mis
   3 │ Mountain beaver             Aplodontia    herbi     Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina       omni      Soricomorpha    lc
   5 │ Cow                         Bos           herbi     Artiodactyla    dom ⋯
   6 │ Three-toed sloth            Bradypus      herbi     Pilosa          mis
   7 │ Northern fur seal           Callorhinus   carni     Carnivora       vu
   8 │ Vesper mouse                Calomys       missing   Rodentia        mis
  ⋮  │             ⋮                    ⋮           ⋮            ⋮             ⋱
  77 │ Brazilian tapir             Tapirus       herbi     Perissodactyla  vu  ⋯
  78 │ Tenrec                      Tenrec        omni      Afrosoricida    mis
  79 │ Tree shrew                  Tupaia        omni      Scandentia      mis
  80 │ Bottle-nosed dolphin        Tursiops      carni     Cetacea         mis
  81 │ Genet                       Genetta       carni     Carnivora       mis ⋯
  82 │ Arctic fox                  Vulpes        carni     Carnivora       mis
  83 │ Red fox                     Vulpes        carni     Carnivora       mis
                                                   8 columns and 68 rows omitted

You can add many columns at a time by placing multiple operations in a block.

julia> @rtransform msleep begin
           :rem_proportion = :sleep_rem / :sleep_total
           :bodywt_grams = :bodywt * 1000
       end83×13 DataFrame
 Row │ name                        genus         vore      order           con ⋯
     │ String31                    String15      String7?  String15        Str ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni     Carnivora       lc  ⋯
   2 │ Owl monkey                  Aotus         omni      Primates        mis
   3 │ Mountain beaver             Aplodontia    herbi     Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina       omni      Soricomorpha    lc
   5 │ Cow                         Bos           herbi     Artiodactyla    dom ⋯
   6 │ Three-toed sloth            Bradypus      herbi     Pilosa          mis
   7 │ Northern fur seal           Callorhinus   carni     Carnivora       vu
   8 │ Vesper mouse                Calomys       missing   Rodentia        mis
  ⋮  │             ⋮                    ⋮           ⋮            ⋮             ⋱
  77 │ Brazilian tapir             Tapirus       herbi     Perissodactyla  vu  ⋯
  78 │ Tenrec                      Tenrec        omni      Afrosoricida    mis
  79 │ Tree shrew                  Tupaia        omni      Scandentia      mis
  80 │ Bottle-nosed dolphin        Tursiops      carni     Cetacea         mis
  81 │ Genet                       Genetta       carni     Carnivora       mis ⋯
  82 │ Arctic fox                  Vulpes        carni     Carnivora       mis
  83 │ Red fox                     Vulpes        carni     Carnivora       mis
                                                   9 columns and 68 rows omitted

Using @transform instead of @rtransform will let us work with the column as a whole, and not a single row at a time. Let's create a new variable showing how far an animal's sleep time is from the average of all animals.

julia> @transform msleep :demeand_sleep = :sleep_total .- mean(:sleep_total)83×12 DataFrame
 Row │ name                        genus         vore      order           con ⋯
     │ String31                    String15      String7?  String15        Str ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni     Carnivora       lc  ⋯
   2 │ Owl monkey                  Aotus         omni      Primates        mis
   3 │ Mountain beaver             Aplodontia    herbi     Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina       omni      Soricomorpha    lc
   5 │ Cow                         Bos           herbi     Artiodactyla    dom ⋯
   6 │ Three-toed sloth            Bradypus      herbi     Pilosa          mis
   7 │ Northern fur seal           Callorhinus   carni     Carnivora       vu
   8 │ Vesper mouse                Calomys       missing   Rodentia        mis
  ⋮  │             ⋮                    ⋮           ⋮            ⋮             ⋱
  77 │ Brazilian tapir             Tapirus       herbi     Perissodactyla  vu  ⋯
  78 │ Tenrec                      Tenrec        omni      Afrosoricida    mis
  79 │ Tree shrew                  Tupaia        omni      Scandentia      mis
  80 │ Bottle-nosed dolphin        Tursiops      carni     Cetacea         mis
  81 │ Genet                       Genetta       carni     Carnivora       mis ⋯
  82 │ Arctic fox                  Vulpes        carni     Carnivora       mis
  83 │ Red fox                     Vulpes        carni     Carnivora       mis
                                                   8 columns and 68 rows omitted

Finally, note that you can create a new column with the name taken from an existing variable, or a new column name with spaces in it, with $

julia> newname = :rem_proportion:rem_proportion
julia> @rtransform msleep begin
           $newname = :sleep_rem / :sleep_total
           $"Body weight in grams" = :bodywt * 1000
       end83×13 DataFrame
 Row │ name                        genus         vore      order           con ⋯
     │ String31                    String15      String7?  String15        Str ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni     Carnivora       lc  ⋯
   2 │ Owl monkey                  Aotus         omni      Primates        mis
   3 │ Mountain beaver             Aplodontia    herbi     Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina       omni      Soricomorpha    lc
   5 │ Cow                         Bos           herbi     Artiodactyla    dom ⋯
   6 │ Three-toed sloth            Bradypus      herbi     Pilosa          mis
   7 │ Northern fur seal           Callorhinus   carni     Carnivora       vu
   8 │ Vesper mouse                Calomys       missing   Rodentia        mis
  ⋮  │             ⋮                    ⋮           ⋮            ⋮             ⋱
  77 │ Brazilian tapir             Tapirus       herbi     Perissodactyla  vu  ⋯
  78 │ Tenrec                      Tenrec        omni      Afrosoricida    mis
  79 │ Tree shrew                  Tupaia        omni      Scandentia      mis
  80 │ Bottle-nosed dolphin        Tursiops      carni     Cetacea         mis
  81 │ Genet                       Genetta       carni     Carnivora       mis ⋯
  82 │ Arctic fox                  Vulpes        carni     Carnivora       mis
  83 │ Red fox                     Vulpes        carni     Carnivora       mis
                                                   9 columns and 68 rows omitted

Create Summaries of the Data Frame using `@combine`

The @combine macro will create summary statistics for a given column in the data frame, such as finding the mean. For example, to compute the average number of hours of sleep, apply the mean function to the column :sleep_total and call the summary value :avg_sleep.

julia> @chain msleep @combine :avg_sleep = mean(:sleep_total)1×1 DataFrame
 Row │ avg_sleep
     │ Float64
─────┼───────────
   1 │   10.4337

There are many other summary statistics you could consider such std, minimum, maximum, median, sum, length (returns the length of vector), first (returns first value in vector), and last (returns last value in vector).

julia> @combine msleep begin
           :avg_sleep = mean(:sleep_total)
           :min_sleep = minimum(:sleep_total)
           :max_sleep = maximum(:sleep_total)
           :total = length(:sleep_total)
       end1×4 DataFrame
 Row │ avg_sleep  min_sleep  max_sleep  total
     │ Float64    Float64    Float64    Int64
─────┼────────────────────────────────────────
   1 │   10.4337        1.9       19.9     83

DataFrames.jl also provides the function describe which performs many of these summaries automatically.

julia> describe(msleep)11×7 DataFrame
 Row │ variable      mean      min               median    max                 ⋯
     │ Symbol        Union…    Any               Union…    Any                 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ name                    African elephant            Western american ch ⋯
   2 │ genus                   Acinonyx                    Vulpes
   3 │ vore                    carni                       omni
   4 │ order                   Afrosoricida                Soricomorpha
   5 │ conservation            cd                          vu                  ⋯
   6 │ sleep_total   10.4337   1.9               10.1      19.9
   7 │ sleep_rem     1.87541   0.1               1.5       6.6
   8 │ sleep_cycle   0.439583  0.116667          0.333333  1.5
   9 │ awake         13.5675   4.1               13.9      22.1                ⋯
  10 │ brainwt       0.281581  0.00014           0.0124    5.712
  11 │ bodywt        166.136   0.005             1.67      6654.0
                                                               3 columns omitted

Group Operations using `@groupby` and `@combine`

The @groupby verb is the first step in the "split-apply-combine" workflow. We literally want to split the data frame by some variable (e.g. taxonomic order), apply a function to the individual data frames and then combine the output.

Let's do that: split the msleep data frame by the taxonomic order, then ask for the same summary statistics as above. We expect a set of summary statistics for each taxonomic order.

julia> @chain msleep begin
           @groupby :order
           @combine begin
               :avg_sleep = mean(:sleep_total)
               :min_sleep = minimum(:sleep_total)
               :max_sleep = maximum(:sleep_total)
               :total = length(:sleep_total)
           end
       end19×5 DataFrame
 Row │ order            avg_sleep  min_sleep  max_sleep  total
     │ String15         Float64    Float64    Float64    Int64
─────┼─────────────────────────────────────────────────────────
   1 │ Carnivora         10.1167         3.5       15.8     12
   2 │ Primates          10.5            8.0       17.0     12
   3 │ Rodentia          12.4682         7.0       16.6     22
   4 │ Soricomorpha      11.1            8.4       14.9      5
   5 │ Artiodactyla       4.51667        1.9        9.1      6
   6 │ Pilosa            14.4           14.4       14.4      1
   7 │ Cingulata         17.75          17.4       18.1      2
   8 │ Hyracoidea         5.66667        5.3        6.3      3
  ⋮  │        ⋮             ⋮          ⋮          ⋮        ⋮
  13 │ Erinaceomorpha    10.2           10.1       10.3      2
  14 │ Cetacea            4.5            2.7        5.6      3
  15 │ Lagomorpha         8.4            8.4        8.4      1
  16 │ Diprotodontia     12.4           11.1       13.7      2
  17 │ Monotremata        8.6            8.6        8.6      1
  18 │ Afrosoricida      15.6           15.6       15.6      1
  19 │ Scandentia         8.9            8.9        8.9      1
                                                 4 rows omitted

Split-apply-combine can also be used with @transform to add new variables to a data frame by performing operations by group. For instance, we can de-mean the total hours of sleep of an animal relative to other animals in the same genus.

julia> @chain msleep begin
           @groupby :order
           @transform :sleep_genus = :sleep_total .- mean(:sleep_total)
       end83×12 DataFrame
 Row │ name                        genus         vore      order           con ⋯
     │ String31                    String15      String7?  String15        Str ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx      carni     Carnivora       lc  ⋯
   2 │ Owl monkey                  Aotus         omni      Primates        mis
   3 │ Mountain beaver             Aplodontia    herbi     Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina       omni      Soricomorpha    lc
   5 │ Cow                         Bos           herbi     Artiodactyla    dom ⋯
   6 │ Three-toed sloth            Bradypus      herbi     Pilosa          mis
   7 │ Northern fur seal           Callorhinus   carni     Carnivora       vu
   8 │ Vesper mouse                Calomys       missing   Rodentia        mis
  ⋮  │             ⋮                    ⋮           ⋮            ⋮             ⋱
  77 │ Brazilian tapir             Tapirus       herbi     Perissodactyla  vu  ⋯
  78 │ Tenrec                      Tenrec        omni      Afrosoricida    mis
  79 │ Tree shrew                  Tupaia        omni      Scandentia      mis
  80 │ Bottle-nosed dolphin        Tursiops      carni     Cetacea         mis
  81 │ Genet                       Genetta       carni     Carnivora       mis ⋯
  82 │ Arctic fox                  Vulpes        carni     Carnivora       mis
  83 │ Red fox                     Vulpes        carni     Carnivora       mis
                                                   8 columns and 68 rows omitted

This short tutorial only touches on the wide array of features in Julia, DataFrames.jl, and DataFramesMeta.jl. Read the full documentation for more information.