Avro
Documentation for Avro.
Avro.Avro
— ModuleThe Avro.jl package provides a pure Julia implementation for reading writing data in the avro format.
Implementation status
It currently supports:
- All primitive types
- All nested/complex types
- Logical types listed in the spec (Decimal, UUID, Date, Time, Timestamps, Duration)
- Binary encoding/decoding
- Reading/writing object container files via the Tables.jl interface
- Supports the xz, zstd, deflate, and bzip2 compression codecs for object container files
Currently not supported are:
- JSON encoding/decoding of objects
- Single object encoding or schema fingerprints
- Schema resolution
- Protocol messages, calls, handshakes
- Snappy compression
Package motivation
Why use the avro format vs. other data formats? Some benefits include:
- Very concise binary encoding, especially object container files with compression
- Very fast reading/writing
- Objects/data must have well-defined schema
- One of the few "row-oriented" binary data formats
Getting started
The Avro.jl package supports two main APIs to interact with avro data. The first is similar to the JSON3.jl struct API for interacting with json data, largely in part due to the similarities between the avro format and json. This looks like:
buf = Avro.write(obj)
obj = Avro.read(buf, typeof(obj))
In short, we use Avro.write
and provide an object obj
to write out in the avro format. We can optionally provide a filename or IO
as a first argument to write the data to.
We can then read the data back in using Avro.read
, where the first argument must be a filename, IO
, or any AbstractVector{UInt8}
byte buffer containing avro data. The 2nd argument is required, and is the type of data to be read. This type can be provided as a simple Julia type (like Avro.read(buf, Vector{String})
), or as a parsed avro schema, like Avro.read(buf, Avro.parseschema("schema.avsc"))
. Avro.parseschema
takes a filename or json string representing the avro schema of the data to read and returns a "schema type" that can be passed to Avro.read
.
The second alternative API allows "packaging" the data's schema with the data in what the avro spec calls an "object container" file. While Avro.read
/Avro.write
require the user to already know or pass the schema externally, Avro.writetable
and Avro.readtable
can write/read avro object container files, and will take care of any schema writing/reading, compression, etc. automatically. These table functions unsurprisingly utilize the ubiquitous Tables.jl interface to facilitate integrations with other formats.
# write our input_table out to a file named "data.avro" using the zstd compression codec
# input_table can be any Tables.jl-compatible source, like CSV.File, Arrow.Table, DataFrame, etc.
Avro.writetable("data.avro", input_table; compress=:zstd)
# we can also read avro data from object container files
# if file uses compression, it will be decompressed automatically
# the schema of the data is packaged in the object container file itself
# and will be parsed before constructing the file table
tbl = Avro.readtable("data.avro")
# the returned type is `Avro.Table`, which satisfies the Tables.jl interface
# which means it can be sent to any valid sink function, like
# Arrow.write("data.arrow", tbl), CSV.write("data.csv", tbl), or DataFrame(tbl)
Avro.Table
— TypeAvro.Table
A Tables.jl-compatible source returned from Avro.readtable
. Conceptually, it can be thought of as an AbstractVector{Record}
, where Record
is the avro version of a "row" or NamedTuple. Thus, Avro.Table
supports indexing/iteration like an AbstractVector
.
Avro.parseschema
— MethodAvro.parseschema(file_or_jsonstring)
Parse the avro schema in a file or raw json string. The schema is expected to follow the format as described in the official spec. Returns a "schema type" that can be passed to Avro.read(buf, sch)
as the 2nd argument.
Avro.read
— FunctionAvro.read(source, T_or_sch) => T
Read an avro-encoded object of type T
or avro schema sch
from source
, which can be a byte buffer AbstractVector{UInt8}
, file name String
, or IO
.
The data in source
must be avro-formatted data, as no schema verification can be done. Note that "avro object container files" should be processed using Avro.readtable
instead, where the data schema is encoded in the file itself. Also note that the 2nd argument can be a Julia type like Vector{String}
, or a valid Avro.Schema
type object, like is returned from Avro.parseschema(src)
.
Avro.readtable
— FunctionAvro.readtable(file_or_io) => Avro.Table
Read an avro object container file, returning an Avro.Table
type, which is like an array of records, where each record follows the schema encoded with the file. Any compression will be detected and decompressed automatically when reading.
Avro.tobuffer
— MethodAvro.tobuffer(tbl; kw...)
Take a Tables.jl-compatible input tbl
and call Avro.writetable
with an IOBuffer
, which is returned, with position at the beginning.
Avro.write
— FunctionAvro.write([filename|io,] x::T; kw...)
Write an object x
of avro-supported type T
in the avro format. If a file name is provided as a String
as 1st argument, the avro data will be written out to disk. Similarly, an IO
argument can be provided as 1st argument. If no destination 1st argument is provided, a byte buffer Vector{UInt8}
will be returned with avro data written to it. Supported keyword arguments include:
schema
: the type that should be used when encoding the object in
the avro format; most common is providing a Union{...}
type to write the data out specifically as a "union type" instead of only the type of the object; alternatively, a valid Avro.Schema
can be passed, like the result of Avro.parseschema(src)
Avro.writetable
— FunctionAvro.writetable(io_or_file, tbl; kw...)
Write an input Tables.jl-compatible source table tbl
out as an avro object container file. io_or_file
can be a file name as a String
or IO
argument. If the input table supports Table.partitions
, each partition will be written as a separate "block" in the container file.
Because avro data is strictly typed, if the input table doesn't have a well-defined schema (i.e. Tables.schema(Tables.rows(tbl)) === nothing
), then Tables.dictrowtable(Tables.rows(tbl))
will be called, which scans the input table, "building up" the schema based on types of values found in each row.
Compression is supported via the compress
keyword argument, and can currently be one of :zstd
, :deflate
, :bzip2
, or :xz
.