DataFramesMeta.jl Tutorial

This is a port of the HarvardX series PH525x Genomics class tutorial on dplyr. View the original here and the source here.

What is DataFramesMeta.jl?

DataFramesMeta.jl is a Julia package to transform and summarize tabular data. It uses Julia macros to create a domain-specific language for convenient syntax to work with data frames from DataFrames.jl. DataFramesMeta.jl mirrors concepts in DataFrames.jl as closely as possible, without implementing new features on it's own. For a deeper explanation of DataFramesMeta.jl, see the documentation.

DataFramesMeta.jl is heavily inspired by R's dplyr. If you are familiar with dplyr this guide should get you up to speed with DataFramesMeta.jl.

However this tutorial assumes a cursory knowledge of DataFrames.jl. For instance, you should be familiar with the concept of a symbol in Julia (:x), and how it is used to index a data frame in DataFrames.jl, such as with df[:, :x].

Why Is It Useful?

Like dplyr, the DataFramesMeta.jl package contains a set of macros (or "verbs") that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, and adding new columns.

In addition, DataFramesMeta.jl makes it easier to summarize data with the @combine verb, which implements the split-apply-combine pattern commonly seen in dplyr and other data manipulation packages.

How Does It Compare To Using Base Functions in Julia and in DataFrames.jl?

If you are familiar with Julia, you are probably familiar with base Julia functions such map, and broadcast (akin to *apply in R). These functions are convenient to use, but are designed to work with arrays, not tabular data.

DataFrames.jl provides the functions select, transform, and more to work with data frames. Unlike map and broadcast, these functions are designed to work with tabular data, but have a complicated syntax.

DataFramesMeta.jl provides a convenient syntax for working with the vectors in a data frame so that you get all the convenience of Base Julia and DataFrames combined.

How Do I Get DataFramesMeta.jl?

First, we will install DataFramesMeta.jl, which also installs DataFrames.jl. We will also install some additional packages

julia> ] # press ] to drop into pkg-mode

pkg> activate --temp # activate a temporary environment for this tutorial

pkg> add DataFramesMeta

pkg> add CSV HTTP

To load DataFramesMeta.jl, which also loads DataFrames.jl:

julia> using DataFramesMeta

Now we load them. We also load the Statistics standard library, which is shipped with Julia, so does not need to be installed.

julia> using CSV, HTTP, Statistics

We will use CSV.jl and HTTP.jl for downloading our dataset from the internet.

Data: Mammals Sleep

The msleep (mammals sleep) data set contains the sleep times and weights for a set of mammals and is available in the dagdata repository on GitHub. This data set contains 83 rows and 11 variables.

We can load the data directly into a DataFrame from the url.

julia> url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv";
julia> msleep = CSV.read(HTTP.get(url).body, DataFrame; missingstring="NA")83×11 DataFrame Row name genus vore order cons String31 String15 String7? String15 Stri ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Cheetah Acinonyx carni Carnivora lc ⋯ 2 │ Owl monkey Aotus omni Primates miss 3 │ Mountain beaver Aplodontia herbi Rodentia nt 4 │ Greater short-tailed shrew Blarina omni Soricomorpha lc 5 │ Cow Bos herbi Artiodactyla dome ⋯ 6 │ Three-toed sloth Bradypus herbi Pilosa miss 7 │ Northern fur seal Callorhinus carni Carnivora vu 8 │ Vesper mouse Calomys missing Rodentia miss ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ 77 │ Brazilian tapir Tapirus herbi Perissodactyla vu ⋯ 78 │ Tenrec Tenrec omni Afrosoricida miss 79 │ Tree shrew Tupaia omni Scandentia miss 80 │ Bottle-nosed dolphin Tursiops carni Cetacea miss 81 │ Genet Genetta carni Carnivora miss ⋯ 82 │ Arctic fox Vulpes carni Carnivora miss 83 │ Red fox Vulpes carni Carnivora miss 7 columns and 68 rows omitted

The columns (in order) correspond to the following:

column nameDescription
:namecommon name
:genustaxonomic rank
:vorecarnivore, omnivore or herbivore?
:ordertaxonomic rank
:conservationthe conservation status of the mammal
:sleep_totaltotal amount of sleep, in hours
:sleep_remrem sleep, in hours
:sleep_cyclelength of sleep cycle, in hours
:awakeamount of time spent awake, in hours
:brainwtbrain weight in kilograms
:bodywtbody weight in kilograms

Important DataFramesMeta.jl Verbs To Remember

Many DataFrames.jl macros come in two forms, a version which operates on columns as a whole and a version which operations row-wise, prefixed by r.

DataFramesMeta.jl macroBy-row versionDescriptiondplyr equivalent
@select@rselectselect columnsselect
@transform@rtransformcreate new columnsmutate
@subset@rsubsetfilter rowsfilter
@orderby@rorderbyre-order or arrange rowsarrange
@combinesummarise valuessummarize (but @combine is more flexible)
@groupbyallows for group operations in the "split-apply-combine" conceptgroup_by

DataFramesMeta.jl Verbs In Action

Two of the most basic functions are @select and @subset, which selects columns and filters rows respectively. To reference columns, use the Symbol of the column name, i.e. :name refers to the column msleep.name.

Selecting Columns Using @select

Select a set of columns: the :name and the :sleep_total columns.

julia> @select msleep :name :sleep_total83×2 DataFrame
 Row  name                        sleep_total 
      String31                    Float64     
─────┼─────────────────────────────────────────
   1 │ Cheetah                            12.1
   2 │ Owl monkey                         17.0
   3 │ Mountain beaver                    14.4
   4 │ Greater short-tailed shrew         14.9
   5 │ Cow                                 4.0
   6 │ Three-toed sloth                   14.4
   7 │ Northern fur seal                   8.7
   8 │ Vesper mouse                        7.0
  ⋮  │             ⋮                    ⋮
  77 │ Brazilian tapir                     4.4
  78 │ Tenrec                             15.6
  79 │ Tree shrew                          8.9
  80 │ Bottle-nosed dolphin                5.2
  81 │ Genet                               6.3
  82 │ Arctic fox                         12.5
  83 │ Red fox                             9.8
                                68 rows omitted

If you have a column name stored as a variable, you can select it as a column with the syntax $.

julia> varname = :sleep_total:sleep_total
julia> @select msleep :name $varname83×2 DataFrame Row name sleep_total String31 Float64 ─────┼───────────────────────────────────────── 1 │ Cheetah 12.1 2 │ Owl monkey 17.0 3 │ Mountain beaver 14.4 4 │ Greater short-tailed shrew 14.9 5 │ Cow 4.0 6 │ Three-toed sloth 14.4 7 │ Northern fur seal 8.7 8 │ Vesper mouse 7.0 ⋮ │ ⋮ ⋮ 77 │ Brazilian tapir 4.4 78 │ Tenrec 15.6 79 │ Tree shrew 8.9 80 │ Bottle-nosed dolphin 5.2 81 │ Genet 6.3 82 │ Arctic fox 12.5 83 │ Red fox 9.8 68 rows omitted

The $ sign has special meaning in DataFramesMeta.jl. We use it for any column reference which is not a symbol. Without it, DataFramesMeta.jl can't tell whether varname refers to the column :sleep_total.

You can also use $ to refer to columns with strings

julia> varname = "sleep_total""sleep_total"
julia> @select msleep :name $varname83×2 DataFrame Row name sleep_total String31 Float64 ─────┼───────────────────────────────────────── 1 │ Cheetah 12.1 2 │ Owl monkey 17.0 3 │ Mountain beaver 14.4 4 │ Greater short-tailed shrew 14.9 5 │ Cow 4.0 6 │ Three-toed sloth 14.4 7 │ Northern fur seal 8.7 8 │ Vesper mouse 7.0 ⋮ │ ⋮ ⋮ 77 │ Brazilian tapir 4.4 78 │ Tenrec 15.6 79 │ Tree shrew 8.9 80 │ Bottle-nosed dolphin 5.2 81 │ Genet 6.3 82 │ Arctic fox 12.5 83 │ Red fox 9.8 68 rows omitted

as well as vectors of variable names

julia> varnames = ["name", "sleep_total"]2-element Vector{String}:
 "name"
 "sleep_total"
julia> @select msleep $varnames83×2 DataFrame Row name sleep_total String31 Float64 ─────┼───────────────────────────────────────── 1 │ Cheetah 12.1 2 │ Owl monkey 17.0 3 │ Mountain beaver 14.4 4 │ Greater short-tailed shrew 14.9 5 │ Cow 4.0 6 │ Three-toed sloth 14.4 7 │ Northern fur seal 8.7 8 │ Vesper mouse 7.0 ⋮ │ ⋮ ⋮ 77 │ Brazilian tapir 4.4 78 │ Tenrec 15.6 79 │ Tree shrew 8.9 80 │ Bottle-nosed dolphin 5.2 81 │ Genet 6.3 82 │ Arctic fox 12.5 83 │ Red fox 9.8 68 rows omitted

Similarly, to select the first column, use the syntax $1.

julia> @select msleep $183×1 DataFrame
 Row  name                       
      String31                   
─────┼────────────────────────────
   1 │ Cheetah
   2 │ Owl monkey
   3 │ Mountain beaver
   4 │ Greater short-tailed shrew
   5 │ Cow
   6 │ Three-toed sloth
   7 │ Northern fur seal
   8 │ Vesper mouse
  ⋮  │             ⋮
  77 │ Brazilian tapir
  78 │ Tenrec
  79 │ Tree shrew
  80 │ Bottle-nosed dolphin
  81 │ Genet
  82 │ Arctic fox
  83 │ Red fox
                   68 rows omitted

To select all the columns except a specific column, use the Not function for inverse selection.

julia> @select msleep Not(:name)83×10 DataFrame
 Row  genus        vore      order           conservation  sleep_total  sleep      String15     String7?  String15        String15?     Float64      Float ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Acinonyx     carni     Carnivora       lc                   12.1  missi ⋯
   2 │ Aotus        omni      Primates        missing              17.0
   3 │ Aplodontia   herbi     Rodentia        nt                   14.4
   4 │ Blarina      omni      Soricomorpha    lc                   14.9
   5 │ Bos          herbi     Artiodactyla    domesticated          4.0        ⋯
   6 │ Bradypus     herbi     Pilosa          missing              14.4
   7 │ Callorhinus  carni     Carnivora       vu                    8.7
   8 │ Calomys      missing   Rodentia        missing               7.0  missi
  ⋮  │      ⋮          ⋮            ⋮              ⋮             ⋮           ⋮ ⋱
  77 │ Tapirus      herbi     Perissodactyla  vu                    4.4        ⋯
  78 │ Tenrec       omni      Afrosoricida    missing              15.6
  79 │ Tupaia       omni      Scandentia      missing               8.9
  80 │ Tursiops     carni     Cetacea         missing               5.2  missi
  81 │ Genetta      carni     Carnivora       missing               6.3        ⋯
  82 │ Vulpes       carni     Carnivora       missing              12.5  missi
  83 │ Vulpes       carni     Carnivora       missing               9.8
                                                   5 columns and 68 rows omitted

To select a range of columns by name, use the Between operator:

julia> @select msleep Between(:name, :order)83×4 DataFrame
 Row  name                        genus        vore      order          
      String31                    String15     String7?  String15       
─────┼───────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx     carni     Carnivora
   2 │ Owl monkey                  Aotus        omni      Primates
   3 │ Mountain beaver             Aplodontia   herbi     Rodentia
   4 │ Greater short-tailed shrew  Blarina      omni      Soricomorpha
   5 │ Cow                         Bos          herbi     Artiodactyla
   6 │ Three-toed sloth            Bradypus     herbi     Pilosa
   7 │ Northern fur seal           Callorhinus  carni     Carnivora
   8 │ Vesper mouse                Calomys      missing   Rodentia
  ⋮  │             ⋮                    ⋮          ⋮            ⋮
  77 │ Brazilian tapir             Tapirus      herbi     Perissodactyla
  78 │ Tenrec                      Tenrec       omni      Afrosoricida
  79 │ Tree shrew                  Tupaia       omni      Scandentia
  80 │ Bottle-nosed dolphin        Tursiops     carni     Cetacea
  81 │ Genet                       Genetta      carni     Carnivora
  82 │ Arctic fox                  Vulpes       carni     Carnivora
  83 │ Red fox                     Vulpes       carni     Carnivora
                                                          68 rows omitted

To select all columns that start with the character string "sl" use regular expressions in conjunction with Cols.

julia> @select msleep Cols(r"^sl")83×3 DataFrame
 Row  sleep_total  sleep_rem  sleep_cycle    
      Float64      Float64?   Float64?       
─────┼────────────────────────────────────────
   1 │        12.1  missing    missing        
   2 │        17.0        1.8  missing        
   3 │        14.4        2.4  missing        
   4 │        14.9        2.3        0.133333
   5 │         4.0        0.7        0.666667
   6 │        14.4        2.2        0.766667
   7 │         8.7        1.4        0.383333
   8 │         7.0  missing    missing        
  ⋮  │      ⋮           ⋮            ⋮
  77 │         4.4        1.0        0.9
  78 │        15.6        2.3  missing        
  79 │         8.9        2.6        0.233333
  80 │         5.2  missing    missing        
  81 │         6.3        1.3  missing        
  82 │        12.5  missing    missing        
  83 │         9.8        2.4        0.35
                               68 rows omitted

Regular expressions are powerful, but can be difficult for new users to understand. Here are some quick tips.

  1. r"^abc" = Starts with "abc"
  2. r"abc$" = Ends with "abc"
  3. r"abc" = Contains "abc" anywhere.

Selecting Rows Using @subset and @rsubset

Filter the rows for mammals that sleep a total of more than 16 hours.

julia> @subset msleep :sleep_total .>= 168×11 DataFrame
 Row  name                    genus         vore      order            conser      String31                String15      String7?  String15         String ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Owl monkey              Aotus         omni      Primates         missin ⋯
   2 │ Long-nosed armadillo    Dasypus       carni     Cingulata        lc
   3 │ North American Opossum  Didelphis     omni      Didelphimorphia  lc
   4 │ Big brown bat           Eptesicus     insecti   Chiroptera       lc
   5 │ Thick-tailed opposum    Lutreolina    carni     Didelphimorphia  lc     ⋯
   6 │ Little brown bat        Myotis        insecti   Chiroptera       missin
   7 │ Giant armadillo         Priodontes    insecti   Cingulata        en
   8 │ Arctic ground squirrel  Spermophilus  herbi     Rodentia         lc
                                                               7 columns omitted

In the above expression, the .>= means we "broadcast" the >= comparison across the whole column. We can use a simpler syntax, @rsubset which automatically broadcasts all operations.

julia> @rsubset msleep :sleep_total > 168×11 DataFrame
 Row  name                    genus         vore      order            conser      String31                String15      String7?  String15         String ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Owl monkey              Aotus         omni      Primates         missin ⋯
   2 │ Long-nosed armadillo    Dasypus       carni     Cingulata        lc
   3 │ North American Opossum  Didelphis     omni      Didelphimorphia  lc
   4 │ Big brown bat           Eptesicus     insecti   Chiroptera       lc
   5 │ Thick-tailed opposum    Lutreolina    carni     Didelphimorphia  lc     ⋯
   6 │ Little brown bat        Myotis        insecti   Chiroptera       missin
   7 │ Giant armadillo         Priodontes    insecti   Cingulata        en
   8 │ Arctic ground squirrel  Spermophilus  herbi     Rodentia         lc
                                                               7 columns omitted

Subset the rows for mammals that sleep a total of more than 16 hours and have a body weight of greater than 1 kilogram. For this we put multiple operations on separate lines in a single block.

julia> @rsubset msleep begin
           :sleep_total >= 16
           :bodywt >= 1
       end3×11 DataFrame
 Row  name                    genus       vore      order            conserva      String31                String15    String7?  String15         String15 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Long-nosed armadillo    Dasypus     carni     Cingulata        lc       ⋯
   2 │ North American Opossum  Didelphis   omni      Didelphimorphia  lc
   3 │ Giant armadillo         Priodontes  insecti   Cingulata        en
                                                               7 columns omitted

If you are coming from dplyr, you can also write the above command in a way that looks more familiar.

julia> @rsubset(msleep, :sleep_total >= 16, :bodywt >= 1)3×11 DataFrame
 Row  name                    genus       vore      order            conserva      String31                String15    String7?  String15         String15 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Long-nosed armadillo    Dasypus     carni     Cingulata        lc       ⋯
   2 │ North American Opossum  Didelphis   omni      Didelphimorphia  lc
   3 │ Giant armadillo         Priodontes  insecti   Cingulata        en
                                                               7 columns omitted

Filter the rows for mammals in the Perissodactyla and Primates taxonomic order. We wrap code in a let block to ensure things are fast.

julia> let
           relevant_orders = Set(["Perissodactyla", "Primates"])
           @rsubset msleep :order in relevant_orders
       end15×11 DataFrame
 Row  name             genus          vore      order           conservation       String31         String15       String7?  String15        String15?     ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Owl monkey       Aotus          omni      Primates        missing       ⋯
   2 │ Grivet           Cercopithecus  omni      Primates        lc
   3 │ Horse            Equus          herbi     Perissodactyla  domesticated
   4 │ Donkey           Equus          herbi     Perissodactyla  domesticated
   5 │ Patas monkey     Erythrocebus   omni      Primates        lc            ⋯
   6 │ Galago           Galago         omni      Primates        missing      
   7 │ Human            Homo           omni      Primates        missing      
   8 │ Mongoose lemur   Lemur          herbi     Primates        vu
   9 │ Macaque          Macaca         omni      Primates        missing       ⋯
  10 │ Slow loris       Nyctibeus      carni     Primates        missing      
  11 │ Chimpanzee       Pan            omni      Primates        missing      
  12 │ Baboon           Papio          omni      Primates        missing      
  13 │ Potto            Perodicticus   omni      Primates        lc            ⋯
  14 │ Squirrel monkey  Saimiri        omni      Primates        missing      
  15 │ Brazilian tapir  Tapirus        herbi     Perissodactyla  vu
                                                               6 columns omitted

You can use the boolean operators (e.g. >, <, >=, <=, !=, in) to create the logical tests.

Chain.jl

Before we go any further, let's introduce the @chain macro from Chain.jl, which is re-exported by DataFramesMeta.jl. @chain allows for you to pipe the output from one operation into the input of another operation. The idea of piping is to read the functions from left to right. The syntax and design of @chain is very similar to %>% which users of dplyr are familiar with.

To show it's usefulness, let's use @select and @rsubset one after the other.

msleep_1 = @select msleep :name :sleep_total
msleep_2 = @rsubset msleep_1 :sleep_total > 16

Now in this case, we will pipe the msleep data frame to the function that will select two columns (:name and :sleep_total) and then pipe the new data frame to the @rsubset operation. This method involves awkwardly creating and naming temporary data frames. We can avoid this with @chain.

julia> @chain msleep begin
           @select :name :sleep_total
           @rsubset :sleep_total > 16
       end8×2 DataFrame
 Row  name                    sleep_total 
      String31                Float64     
─────┼─────────────────────────────────────
   1 │ Owl monkey                     17.0
   2 │ Long-nosed armadillo           17.4
   3 │ North American Opossum         18.0
   4 │ Big brown bat                  19.7
   5 │ Thick-tailed opposum           19.4
   6 │ Little brown bat               19.9
   7 │ Giant armadillo                18.1
   8 │ Arctic ground squirrel         16.6

You will soon see how useful the @chain macro is when we start to combine many functions.

Back To dplyr Verbs In Action

Now that you know about the @chain macro, we will use it throughout the rest of this tutorial.

Arrange Or Re-order Rows Using @orderby

To arrange (or re-order) rows by a particular column, such as the taxonomic order, list the name of the column you want to arrange the rows by:

julia> @orderby msleep :order83×11 DataFrame
 Row  name                        genus      vore      order         conserva      String31                    String15   String7?  String15      String15 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Tenrec                      Tenrec     omni      Afrosoricida  missing  ⋯
   2 │ Cow                         Bos        herbi     Artiodactyla  domestic
   3 │ Roe deer                    Capreolus  herbi     Artiodactyla  lc
   4 │ Goat                        Capri      herbi     Artiodactyla  lc
   5 │ Giraffe                     Giraffa    herbi     Artiodactyla  cd       ⋯
   6 │ Sheep                       Ovis       herbi     Artiodactyla  domestic
   7 │ Pig                         Sus        omni      Artiodactyla  domestic
   8 │ Cheetah                     Acinonyx   carni     Carnivora     lc
  ⋮  │             ⋮                   ⋮         ⋮           ⋮             ⋮   ⋱
  77 │ Eastern american chipmunk   Tamias     herbi     Rodentia      missing  ⋯
  78 │ Tree shrew                  Tupaia     omni      Scandentia    missing 
  79 │ Greater short-tailed shrew  Blarina    omni      Soricomorpha  lc
  80 │ Star-nosed mole             Condylura  omni      Soricomorpha  lc
  81 │ Lesser short-tailed shrew   Cryptotis  omni      Soricomorpha  lc       ⋯
  82 │ Eastern american mole       Scalopus   insecti   Soricomorpha  lc
  83 │ Musk shrew                  Suncus     missing   Soricomorpha  missing 
                                                   7 columns and 68 rows omitted

Now we will select three columns from msleep, arrange the rows by the taxonomic order and then arrange the rows by :sleep_total. Finally, keep the first 10 rows of the data frame.

julia> @chain msleep begin
           @select :name :order :sleep_total
           @orderby :order :sleep_total
           first(10)
       end10×3 DataFrame
 Row  name          order         sleep_total 
      String31      String15      Float64     
─────┼─────────────────────────────────────────
   1 │ Tenrec        Afrosoricida         15.6
   2 │ Giraffe       Artiodactyla          1.9
   3 │ Roe deer      Artiodactyla          3.0
   4 │ Sheep         Artiodactyla          3.8
   5 │ Cow           Artiodactyla          4.0
   6 │ Goat          Artiodactyla          5.3
   7 │ Pig           Artiodactyla          9.1
   8 │ Caspian seal  Carnivora             3.5
   9 │ Gray seal     Carnivora             6.2
  10 │ Genet         Carnivora             6.3

The last line of the above block, first(10), does not have @. This is because first is a Julia function, not a macro, whose names always begin with @.

Same as above, except here we filter the rows for mammals that sleep for 16 or more hours, instead of showing the head of the final data frame:

julia> @chain msleep begin
           @select :name :order :sleep_total
           @orderby :order :sleep_total
           @rsubset :sleep_total > 16
       end8×3 DataFrame
 Row  name                    order            sleep_total 
      String31                String15         Float64     
─────┼──────────────────────────────────────────────────────
   1 │ Big brown bat           Chiroptera              19.7
   2 │ Little brown bat        Chiroptera              19.9
   3 │ Long-nosed armadillo    Cingulata               17.4
   4 │ Giant armadillo         Cingulata               18.1
   5 │ North American Opossum  Didelphimorphia         18.0
   6 │ Thick-tailed opposum    Didelphimorphia         19.4
   7 │ Owl monkey              Primates                17.0
   8 │ Arctic ground squirrel  Rodentia                16.6

Something slightly more complicated: same as above, except arrange the rows in the :sleep_total column in a descending order. Since this column is numeric we can do it by negating it:

julia> @chain msleep begin
           @select :name :order :sleep_total
           @orderby begin
               :order
               -:sleep_total
           end
           @rsubset :sleep_total >= 16
       end8×3 DataFrame
 Row  name                    order            sleep_total 
      String31                String15         Float64     
─────┼──────────────────────────────────────────────────────
   1 │ Little brown bat        Chiroptera              19.9
   2 │ Big brown bat           Chiroptera              19.7
   3 │ Giant armadillo         Cingulata               18.1
   4 │ Long-nosed armadillo    Cingulata               17.4
   5 │ Thick-tailed opposum    Didelphimorphia         19.4
   6 │ North American Opossum  Didelphimorphia         18.0
   7 │ Owl monkey              Primates                17.0
   8 │ Arctic ground squirrel  Rodentia                16.6

Alternatively, using the StatsBase.jl package you could have written ordinalrank(:sleep_total, rev=true). Finally, if you wanted to avoid using an additional package you could use the following expression invperm(sortperm(:sleep_total, rev=true)).

Here is a minimal example of sorting string column in reverse:

julia> using StatsBase
julia> df = DataFrame(group=[1, 2, 1, 2, 1], name = ["Bob", "Dexter", "Alice", "Eve", "Cedric"])5×2 DataFrame Row group name Int64 String ─────┼─────────────── 1 │ 1 Bob 2 │ 2 Dexter 3 │ 1 Alice 4 │ 2 Eve 5 │ 1 Cedric
julia> @orderby df begin :group ordinalrank(:name, rev=true) end5×2 DataFrame Row group name Int64 String ─────┼─────────────── 1 │ 1 Cedric 2 │ 1 Bob 3 │ 1 Alice 4 │ 2 Eve 5 │ 2 Dexter

Create New Columns Using @transform and @rtransform

The @transform macro will add new columns to the data frame. Like with other macros, use @rtransform to operate row-wise. Create a new column called :rem_proportion, which is the ratio of rem sleep to total amount of sleep.

julia> @rtransform msleep :rem_proportion = :sleep_rem / :sleep_total83×12 DataFrame
 Row  name                        genus        vore      order           cons      String31                    String15     String7?  String15        Stri ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx     carni     Carnivora       lc   ⋯
   2 │ Owl monkey                  Aotus        omni      Primates        miss
   3 │ Mountain beaver             Aplodontia   herbi     Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina      omni      Soricomorpha    lc
   5 │ Cow                         Bos          herbi     Artiodactyla    dome ⋯
   6 │ Three-toed sloth            Bradypus     herbi     Pilosa          miss
   7 │ Northern fur seal           Callorhinus  carni     Carnivora       vu
   8 │ Vesper mouse                Calomys      missing   Rodentia        miss
  ⋮  │             ⋮                    ⋮          ⋮            ⋮              ⋱
  77 │ Brazilian tapir             Tapirus      herbi     Perissodactyla  vu   ⋯
  78 │ Tenrec                      Tenrec       omni      Afrosoricida    miss
  79 │ Tree shrew                  Tupaia       omni      Scandentia      miss
  80 │ Bottle-nosed dolphin        Tursiops     carni     Cetacea         miss
  81 │ Genet                       Genetta      carni     Carnivora       miss ⋯
  82 │ Arctic fox                  Vulpes       carni     Carnivora       miss
  83 │ Red fox                     Vulpes       carni     Carnivora       miss
                                                   8 columns and 68 rows omitted

You can add many columns at a time by placing multiple operations in a block.

julia> @rtransform msleep begin
           :rem_proportion = :sleep_rem / :sleep_total
           :bodywt_grams = :bodywt * 1000
       end83×13 DataFrame
 Row  name                        genus        vore      order           cons      String31                    String15     String7?  String15        Stri ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx     carni     Carnivora       lc   ⋯
   2 │ Owl monkey                  Aotus        omni      Primates        miss
   3 │ Mountain beaver             Aplodontia   herbi     Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina      omni      Soricomorpha    lc
   5 │ Cow                         Bos          herbi     Artiodactyla    dome ⋯
   6 │ Three-toed sloth            Bradypus     herbi     Pilosa          miss
   7 │ Northern fur seal           Callorhinus  carni     Carnivora       vu
   8 │ Vesper mouse                Calomys      missing   Rodentia        miss
  ⋮  │             ⋮                    ⋮          ⋮            ⋮              ⋱
  77 │ Brazilian tapir             Tapirus      herbi     Perissodactyla  vu   ⋯
  78 │ Tenrec                      Tenrec       omni      Afrosoricida    miss
  79 │ Tree shrew                  Tupaia       omni      Scandentia      miss
  80 │ Bottle-nosed dolphin        Tursiops     carni     Cetacea         miss
  81 │ Genet                       Genetta      carni     Carnivora       miss ⋯
  82 │ Arctic fox                  Vulpes       carni     Carnivora       miss
  83 │ Red fox                     Vulpes       carni     Carnivora       miss
                                                   9 columns and 68 rows omitted

Using @transform instead of @rtransform will let us work with the column as a whole, and not a single row at a time. Let's create a new variable showing how far an animal's sleep time is from the average of all animals.

julia> @transform msleep :demeand_sleep = :sleep_total .- mean(:sleep_total)83×12 DataFrame
 Row  name                        genus        vore      order           cons      String31                    String15     String7?  String15        Stri ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx     carni     Carnivora       lc   ⋯
   2 │ Owl monkey                  Aotus        omni      Primates        miss
   3 │ Mountain beaver             Aplodontia   herbi     Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina      omni      Soricomorpha    lc
   5 │ Cow                         Bos          herbi     Artiodactyla    dome ⋯
   6 │ Three-toed sloth            Bradypus     herbi     Pilosa          miss
   7 │ Northern fur seal           Callorhinus  carni     Carnivora       vu
   8 │ Vesper mouse                Calomys      missing   Rodentia        miss
  ⋮  │             ⋮                    ⋮          ⋮            ⋮              ⋱
  77 │ Brazilian tapir             Tapirus      herbi     Perissodactyla  vu   ⋯
  78 │ Tenrec                      Tenrec       omni      Afrosoricida    miss
  79 │ Tree shrew                  Tupaia       omni      Scandentia      miss
  80 │ Bottle-nosed dolphin        Tursiops     carni     Cetacea         miss
  81 │ Genet                       Genetta      carni     Carnivora       miss ⋯
  82 │ Arctic fox                  Vulpes       carni     Carnivora       miss
  83 │ Red fox                     Vulpes       carni     Carnivora       miss
                                                   8 columns and 68 rows omitted

Finally, note that you can create a new column with the name taken from an existing variable, or a new column name with spaces in it, with $

julia> newname = :rem_proportion:rem_proportion
julia> @rtransform msleep begin $newname = :sleep_rem / :sleep_total $"Body weight in grams" = :bodywt * 1000 end83×13 DataFrame Row name genus vore order cons String31 String15 String7? String15 Stri ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Cheetah Acinonyx carni Carnivora lc ⋯ 2 │ Owl monkey Aotus omni Primates miss 3 │ Mountain beaver Aplodontia herbi Rodentia nt 4 │ Greater short-tailed shrew Blarina omni Soricomorpha lc 5 │ Cow Bos herbi Artiodactyla dome ⋯ 6 │ Three-toed sloth Bradypus herbi Pilosa miss 7 │ Northern fur seal Callorhinus carni Carnivora vu 8 │ Vesper mouse Calomys missing Rodentia miss ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱ 77 │ Brazilian tapir Tapirus herbi Perissodactyla vu ⋯ 78 │ Tenrec Tenrec omni Afrosoricida miss 79 │ Tree shrew Tupaia omni Scandentia miss 80 │ Bottle-nosed dolphin Tursiops carni Cetacea miss 81 │ Genet Genetta carni Carnivora miss ⋯ 82 │ Arctic fox Vulpes carni Carnivora miss 83 │ Red fox Vulpes carni Carnivora miss 9 columns and 68 rows omitted

Create Summaries of the Data Frame using @combine

The @combine macro will create summary statistics for a given column in the data frame, such as finding the mean. For example, to compute the average number of hours of sleep, apply the mean function to the column :sleep_total and call the summary value :avg_sleep.

julia> @chain msleep @combine :avg_sleep = mean(:sleep_total)1×1 DataFrame
 Row  avg_sleep 
      Float64   
─────┼───────────
   1 │   10.4337

There are many other summary statistics you could consider such std, minimum, maximum, median, sum, length (returns the length of vector), first (returns first value in vector), and last (returns last value in vector).

julia> @combine msleep begin
           :avg_sleep = mean(:sleep_total)
           :min_sleep = minimum(:sleep_total)
           :max_sleep = maximum(:sleep_total)
           :total = length(:sleep_total)
       end1×4 DataFrame
 Row  avg_sleep  min_sleep  max_sleep  total 
      Float64    Float64    Float64    Int64 
─────┼────────────────────────────────────────
   1 │   10.4337        1.9       19.9     83

DataFrames.jl also provides the function describe which performs many of these summaries automatically.

julia> describe(msleep)11×7 DataFrame
 Row  variable      mean      min               median    max                      Symbol        Union…    Any               Union…    Any                 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ name                    African elephant            Western american ch ⋯
   2 │ genus                   Acinonyx                    Vulpes
   3 │ vore                    carni                       omni
   4 │ order                   Afrosoricida                Soricomorpha
   5 │ conservation            cd                          vu                  ⋯
   6 │ sleep_total   10.4337   1.9               10.1      19.9
   7 │ sleep_rem     1.87541   0.1               1.5       6.6
   8 │ sleep_cycle   0.439583  0.116667          0.333333  1.5
   9 │ awake         13.5675   4.1               13.9      22.1                ⋯
  10 │ brainwt       0.281581  0.00014           0.0124    5.712
  11 │ bodywt        166.136   0.005             1.67      6654.0
                                                               3 columns omitted

Group Operations using @groupby and @combine

The @groupby verb is the first step in the "split-apply-combine" workflow. We literally want to split the data frame by some variable (e.g. taxonomic order), apply a function to the individual data frames and then combine the output.

Let's do that: split the msleep data frame by the taxonomic order, then ask for the same summary statistics as above. We expect a set of summary statistics for each taxonomic order.

julia> @chain msleep begin
           @groupby :order
           @combine begin
               :avg_sleep = mean(:sleep_total)
               :min_sleep = minimum(:sleep_total)
               :max_sleep = maximum(:sleep_total)
               :total = length(:sleep_total)
           end
       end19×5 DataFrame
 Row  order           avg_sleep  min_sleep  max_sleep  total 
      String15        Float64    Float64    Float64    Int64 
─────┼────────────────────────────────────────────────────────
   1 │ Carnivora        10.1167         3.5       15.8     12
   2 │ Primates         10.5            8.0       17.0     12
   3 │ Rodentia         12.4682         7.0       16.6     22
   4 │ Soricomorpha     11.1            8.4       14.9      5
   5 │ Artiodactyla      4.51667        1.9        9.1      6
   6 │ Pilosa           14.4           14.4       14.4      1
   7 │ Cingulata        17.75          17.4       18.1      2
   8 │ Hyracoidea        5.66667        5.3        6.3      3
  ⋮  │       ⋮             ⋮          ⋮          ⋮        ⋮
  13 │ Erinaceomorpha   10.2           10.1       10.3      2
  14 │ Cetacea           4.5            2.7        5.6      3
  15 │ Lagomorpha        8.4            8.4        8.4      1
  16 │ Diprotodontia    12.4           11.1       13.7      2
  17 │ Monotremata       8.6            8.6        8.6      1
  18 │ Afrosoricida     15.6           15.6       15.6      1
  19 │ Scandentia        8.9            8.9        8.9      1
                                                4 rows omitted

Split-apply-combine can also be used with @transform to add new variables to a data frame by performing operations by group. For instance, we can de-mean the total hours of sleep of an animal relative to other animals in the same genus.

julia> @chain msleep begin
           @groupby :order
           @transform :sleep_genus = :sleep_total .- mean(:sleep_total)
       end83×12 DataFrame
 Row  name                        genus        vore      order           cons      String31                    String15     String7?  String15        Stri ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Cheetah                     Acinonyx     carni     Carnivora       lc   ⋯
   2 │ Owl monkey                  Aotus        omni      Primates        miss
   3 │ Mountain beaver             Aplodontia   herbi     Rodentia        nt
   4 │ Greater short-tailed shrew  Blarina      omni      Soricomorpha    lc
   5 │ Cow                         Bos          herbi     Artiodactyla    dome ⋯
   6 │ Three-toed sloth            Bradypus     herbi     Pilosa          miss
   7 │ Northern fur seal           Callorhinus  carni     Carnivora       vu
   8 │ Vesper mouse                Calomys      missing   Rodentia        miss
  ⋮  │             ⋮                    ⋮          ⋮            ⋮              ⋱
  77 │ Brazilian tapir             Tapirus      herbi     Perissodactyla  vu   ⋯
  78 │ Tenrec                      Tenrec       omni      Afrosoricida    miss
  79 │ Tree shrew                  Tupaia       omni      Scandentia      miss
  80 │ Bottle-nosed dolphin        Tursiops     carni     Cetacea         miss
  81 │ Genet                       Genetta      carni     Carnivora       miss ⋯
  82 │ Arctic fox                  Vulpes       carni     Carnivora       miss
  83 │ Red fox                     Vulpes       carni     Carnivora       miss
                                                   8 columns and 68 rows omitted

This short tutorial only touches on the wide array of features in Julia, DataFrames.jl, and DataFramesMeta.jl. Read the full documentation for more information.