Most of the data formats that Athena supports have support for complex types in the form of lists and maps. It even supports lists and maps in CSV files, if you really want it to.
The relational model wasn’t made with complex types in mind, but modern data is very rarely flat. Developers like data formats like JSON because they allow for expressing things like sone-to-many relationships naturally and in a self-contained manner. When you try to describe the world you often end up with lists of things, the one-or-more type of relationship is very common, and things like lists of tags, or key/value pairs of metadata is more or less standard.
Before Athena I worked a lot with Redshift, and was often frustrated with the lack of complex types. The data I worked with almost always contained lists of strings and other similarly “simple” complex types. With Athena this is almost never an issue.
In this article I’m going to cover how to use complex types in three different contexts: how to create tables when you have complex types in data, how to work with complex types in queries, and how to deal with complex types in results.
First, let’s define complex types so that we’re on the same page.
Complex types defined
Simple types, or “scalars”, are things like
timestamp. Complex types are types that refer to other types – or more concretely in the case of Athena,
array (lists of elements),
map (key/value associations), and
struct (key/value associations with a fixed schema).
Complex types can be for example
array<string> (an array of strings),
map<string,boolean> (a map with string keys and boolean values), or
struct<name:string,age:tinyint> (a record with a string property called “name” and an integer property called “age”). Complex types can have arbitrarily complex structure, for example
array<struct<title:string,author:struct<name:string,email:string>,tags:map<string,string>>> could be a way to describe a list of article metadata.
Complex types in data
If you are using Athena to query JSON data you have most likely already worked with complex types in your data in the form of an array property or an object property. Being able to describe most JSON data in table form is one of the most powerful features of Athena.
In my experience, most JSON data isn’t very hierarchical. Most of the properties are usually scalar, with one or two lists of either strings or fairly simple objects with one or two properties each.
The official Athena documentation makes a fairly good job at describing how to create tables for JSON data.
What’s good to know is that Athena is farily forgiving when it comes to describing complex types. If you define a column as
array<string> and the JSON document has a list of numbers, that works too, as long as type coercion works there is usually no problem. More importantly, if you define a struct column, the fields of the struct work just like the columns of a table: if they’re not found in the data the value defaults to
NULL, and if there are more properties in the data than in the table definition, those properties are ignored.
Schemaless or schemafree?
JSON doesn’t require a schema, although most JSON data probably has some kind of informal schema defined by the code that produces it. This is roughly what the term “schemaless” means; there’s no formal schema, and it’s possible for documents to have slightly different schemas, but when zooming out and looking at all documents together they look more alike than not.
If you don’t control the code that produces a data set it can take some time to figure out what its schema is. Experimenting with different table designs in Athena is cheap, though. If you get it wrong you can just drop the table and recreate it with a new schema that captures some aspect you had previously missed. There’s very rarely need to fix the data, Athena is all about making it possible to read the data you have.
Lots of tools exist that are aimed at helping you figure out the schema of JSON data. Glue Crawlers, for example, will read a sample of your data and figure out which properties exist, and their types, and at re:Invent 2019, AWS launched a feature that detects the schemas of events sent through EventBridge. In some situations tools like these might be helpful, but in other situations they might just make it worse. Glue Crawlers can, for example, flip-flop the schema of a table from run to run if the sample they look at each time is different enough. Say your documents have a property that contains arbitary key/value pairs from your users. Glue won’t figure out that there is really no schema there, and instead find different schemas on different days.
One exampe of free-form properties like these can be found in CloudTrail logs. There are properties like
additionalEventData that are service specific. Even if these properties could potentially be described using the union of what all AWS services put into them today, we can’t tell how future services will use them. These properties both have and don’t have schemas, depending on your point of view. From a global perspective where you want to consume events from all services, and future services, you’ll have to think of them as free-form structures.
Luckily, there’s a solution for that, and I’m going to describe it in detail when discussing how to work with complex types in queries and free form structures. For now, you can do as the Athena documentation about CloudTrail suggests and use the type
string for columns where the type can’t be pinned down. You can also use
map<string,string> when you know a property is an object, but not what the values are.
Parquet, ORC, and Avro
Parquet, ORC, and Avro are also data formats that support hierarchical structures with lists, maps, and/or structs. One important difference between these and JSON is that these formats have formal schemas, either embedded or in side-cars. Each document in a file will conform to the same schema, and creating a table is often just a matter of translating this schema to Athena compatible types.
JSON and Avro both require keys in objects/maps to be strings, but Parquet and ORC have are more relaxed and allow keys to be of other types. Athena supports this fully and even though your data is probably the limiting factor, you can declare tables with maps where the keys are any scalar type. Complex type keys are not supported, though.
Struct or map?
From the table modelling perspective the struct and map types overlap. In many situations where you have an object property in JSON, for example, you could use either. My general advice is that if you know the names of the fields of the object property you should use a struct in your table. However, if property is free form, as in the CloudTrail example above, a map is more suitable.
Map functions won’t work on struct columns, so if you are planning to use map functions a lot then using the map type will make that easier.
Again, experimenting with table design in Athena is cheap, don’t worry about getting it right the first time. You can always drop the table and create a new one that works better with no or minimal downtime.
Complex types in queries
SQL wasn’t designed with complex types in mind, and most of the support in Athena comes in the form of aggregate functions, and what’s called “lambda expressions”. These are pieces of functional code that can look very out of place in a SQL statement, but can be very powerful when working with complex types. I’ll show you some examples of lambda expressions below.
The complex type that looks the least out of place in SQL is probably
struct. Struct fields are accessed using dot notation, so if there’s a column
author defined as
struct<name:string,email:string> you would access them something like this:
SELECT author.name, author.email FROM my_table
Array access looks like it does in most programming languages with literal lists, you use square brackets to access the elements by index – but in SQL indexes start at 1!
The first element of a column called
tags defined as
array<string> can be accessed like this:
SELECT tags AS first_tag, tags AS second_tag FROM my_table
Elements can also be accessed with
element_at(tags, 1). This variant also allows access from the end with indexes from 0 and negative numbers, e.g.
element_at(tags, 0) is the last element of the array.
Be aware that trying to access an element that doesn’t exist results in a runtime error. The query above would only succeed if every row had two or more elements in their
tags array. If you access array elements by index you can wrap the expression in
try to avoid the error, e.g.
try(tags), or select only rows that have enough elemenents using something like
WHERE cardinality(tags) >= 2.
Accessing array elements by index is rare, though. More common is to flatten the array into a single value, similar to what you do in a
GROUP BY query. There are in fact a whole set of functions that operate on arrays with names similar to the common aggregate functions:
array_distinct, as well as
join to create a string, and
cardinality to count the numer of elements. When a simple aggregate function isn’t enough there’s also
reduce which can be used together with lambda expressions to implement almost any aggregation over an array – more on that below.
There are also functions that produce new arrays, like
concat for combining arrays (which can also be done with
||, just as with strings), or through set operations like
array_intersect, as well as
zip, which combines pairs of elements of multiple arrays – all that can be used in arbitrary combinations to create new arrays, either before aggregation, or as means to produce new arrays for subsequent processing steps.
Sometimes you don’t have an array but want to create one. One way is with the
ARRAY constructor which can create arrays from literals or column references, e.g.
SELECT ARRAY[col1, 'hello', col2] – just make sure that all elements are of the same type.
A really powerful way to create arrays is the
array_agg aggregate function. It can be used to create an array from a group of rows in a
GROUP BY. Say you have a data set created in a database without support for arrays, where you would model a list as a table with an ID column and a value column. Using
array_agg you can create a relation with the values as an array, which can then be processed further as needed:
SELECT article_id, ARRAY_AGG(tag) AS tags FROM article_tags GROUP BY article_id
Other times you want to go in the other direction, you have an array but you want a row per element in the array. For this you can use the
UNNEST operation, which I have have written a separate article about. In many ways you can think of
UNNEST as the reverse of
An column called
params with type
map<string,string> can be accessed like this:
SELECT params['id'] AS id, params['value'] AS value FROM my_table
element_at function also works with maps, e.g.
element_at(tags, 'id'). However, both square brackets and
NULL when an element is not found, in contrast to the array versions that result in an error.
Although less common, maps can be used with
UNNEST to pivot them into rows. See the article
UNNEST for more information.
Many times you want to process the elements of arrays and maps. In many programming languages there is the concept of lambda functions, anonymous functions, or blocks, that are often used to process lists in very compact ways. Athena has something similar, and many functions that process complex types support them.
Say you have an array of strings and want to uppercase them, this can be done like this:
SELECT transform(tags, tag -> upper(tag)) FROM my_table
The lambda expression is in this case
tag -> upper(tag). The symbol before
-> is the argument and the expression after is the body. The body can use almost all functions, and in the example above I use the
upper string function.
Functions that take lambda expressions act each row individually, they are not aggregate functions in the SQL sense. It helps me to think of a regular aggregate function like
max as operating vertically on a table, and an array function like
array_max as operating horizontally on the elements of a column of a single row.
Functions that operate on arrays usually have one argument, the element, and functions that operate on maps two, the key and the value. When there is more than one argument the argument list must be enclosed in parenthesis, with one argument the parenthesis are optional. This is, for example, how you create a new map with all values in upper case:
SELECT transform_values(params, (key, value) -> upper(value)) FROM my_table
Working with free form structures
Sometimes it’s not possible to describe the shape of a data set definitely. There may be parts of it that is free form, like the CloudTrail event properties I discussed above. For these situations, and others, Athena has a type that is only available at runtime called
JSON. You can’t declare a column to be of type
JSON, you have to use the
string type and cast to
JSON in queries, or use one of Athena’s JSON functions.
requestParameters property of CloudTrail events is almost always a map with string keys and complex values, though sometimes it’s
NULL and sometimes it’s the string “null”. Because of the occational string value you have to use the type
string for the column, otherwise
map<string,string> could have worked too.
When something decrypts using KMS it generates a CloudTrail event with these request parameters:
Say you want to do some statistics on this, for example count the number of decryptions with different encryption algorithms, or which Lambda functions are doing the most decryptions. Athena comes with a library of JSON functions that let you write JSONPath expressions to extract values out of JSON strings. To extract a scalar value you use the
SELECT json_extract_scalar(requestparameters, '$.encryptionAlgorithm') AS encryption_algorithm, COUNT(*) AS "count" FROM cloudtrail_events
SELECT json_extract_scalar(requestparameters, '$.encryptionContext["aws:lambda:FunctionArn"]') AS encryption_algorithm, COUNT(*) AS "count" FROM cloudtrail_events
In these cases the value extracted was a scalar, but you can extract arrays and objects too, using
json_extract. The type of the result of this function is JSON.
The JSON type and functions are useful for working with arbitrary structures and columns where the schema is different from row to row. I’ve also come across situations where casting to the JSON type can be a way to get around tricky situations where types don’t exactly match. It’s sort of a wildcard type, and can both be used and abused.
Hide the sausage making with views
When your queries become really complicated due to long expressions that extract values from deeply nested structures, or multiple levels of aggregation and flattening, it’s a good idea to create a view to hide all the messy sausage making. It’s going to me much easier for the people and code that will query the data set if it’s clean and neat.
Complex types in results
So far I’ve described how to work with complex types in the data, how you can work with them in queries, and now it’s time to discuss how to deal with them in the result of a query.
Athena stores query results as CSV files on S3. Regardless of whether you use the console, the API, or the JDBC driver, the results end up as CSV on S3.
CSV is not exactly known for it’s great support for complex types, so what do you do when you want to return an array or a map from a query?
Athena won’t stop you from having arrays and maps in the result, it will dutifully serialize these values into CSV – and make a proper mess out of things. It’s serialization format for lists and maps does not quote the elements, keys, or values, which means that it’s very easy to produce output that is ambiguous. If you see
"[hello, world]" in an Athena output file there is no way to tell if the value was an array with one element (“hello, world”) or two elements (“hello” and “world”). You can also not tell numbers and strings apart, and Athena’s query metadata also doesn’t contain that information, it only specifies if a column is an array or map, not the types of the elements, keys, or values.
Because of this you should never return raw arrays or maps from queries. It may appear to work for some test cases, but in general it is completely unsafe.
So what to do instead? The answer is is the
JSON type. The serialization format for that type is, you guessed it, JSON. JSON serialization is unambigious and retains enough type information for most situations. It’s not perfect, but it’s a lot better than the alternative.
What you do is that in any query that returns complex types, you wrap those expressions in
CAST(… AS JSON), for example:
SELECT article_id, CAST(array_agg(tag) AS JSON) AS tags FROM my_table
Athena’s result metadata will indicate that the
tags column is a string, and you will have to parse it in the code that reads the result data – but in contrast to returning a raw array you will be able to parse it!
You can model very elaborate complex types in Athena tables, just look at the CloudTrail schema, with it’s arrays of structs, and structs within structs. Even when properties are completely free form you won’t get stuck because there’s the JSON type and functions that let you unpack and work with them at query time.
When queries get too complicated you can create views to hide the details of how to extract and flatten the values from hierarchical and messy data models.
Even when you don’t have complex types in the source data you can benefit a lot from Athena’s support for complex types. Being able to aggregate rows into arrays can make things that have always been clunky in the relational model much more elegant.
The only problem with Athena and complex types is really how to deal with them in results, and that the consumers of the results may not expect or know how to deal with complex types. It’s unfortunate that the Athena team didn’t think through the consequences of using CSV as an output format when it comes to complex types, but luckily we can work around it using the JSON data type.