Redshift vs. Athena vs. EMR - AWS’ big data solutions explained

AWS offers a range of big data solutions, but for most, the choice comes down to just three: Amazon Redshift, Amazon Athena and Amazon EMR.

But which one’s the right fit?

Luckily, these three services differ enough that there’s a clear winner for most use cases. And if you keep reading, you’ll find out which tool’s the one for the job.

1. Amazon Redshift

What is Amazon Redshift/Amazon Redshift Serverless?

Amazon Redshift is a fully provisioned cloud data warehouse, while Amazon Redshift Serverless is, of course, the serverless version of that service.

For this reason, Amazon Redshift can be used to store and analyse data, or to serverless-ly analsze data stored in an S3 bucket.

You can think of Redshift as the provisioned or serverless option for performing complex queries on structured data.

What kinds of data queries can you run with Redshift/Redshift Serverless?

Amazon Redshift supports SQL queries, based on a dialect of PostgreSQL. It is suited to structured data and high-performance queries. Redshift uses columnar storage to achieve high performance.

When to use Redshift/Redshift Serverless

Because it’s suited to high-performance queries on structured data, Redshift might be used for business intelligence queries, where the data set is large and structured, as in the case of sales and customer data reports.

If the BI data analysis in the above example were routine and predictable, you might use Amazon Redshift.

But, if there was a less defined process for analysing this large set of structured data, and you didn’t want to expend resources managing the Redshift data warehouse, you could store the data in S3 and have Redshift Serverless analyse it as and when.

2. Amazon Athena

What is Amazon Athena?

Amazon Athena is a serverless query service suited to ad hoc tasks on unstructured data stored in S3.

Unlike Redshift, Amazon Athena is serverless only.

You can think of Athena as the quick, low-cost option for unstructured data.

What kind of data queries can Athena run?

Athena runs SQL queries. Because it’s schema-on-read, Athena doesn’t require a predefined schema to analyse a dataset. Instead, it interprets the data’s schema as it performs the query, making it ideal for ad hoc analysis.

When to use Athena

Athena can be used to quickly and cheaply run analysis on unstructured data.

For example, a team managing a web application may want to analyse its log data. This information usually exists in the team’s S3 bucket and isn’t structured or moved anywhere else.

With Athena, the team can quickly gain insights into patterns and performance without having to build a schema for their usually unattended logs or store the data anywhere else.

3. Amazon EMR

What is Amazon EMR/Amazon EMR Serverless?

Like Redshift, Amazon EMR can be run as serverless or provisioned.

Unlike Redshift and Athena, however, Amazon EMR can run types of query and data transformation beyond SQL. EMR enables users to run open-source big data frameworks such as Apache Hadoop and Apache Spark on AWS.

What kind of data queries can Amazon EMR run?

Amazon EMR can process vast amounts of data using a variety of open-source frameworks.

These frameworks enable distributed data processing, which means you can use various clusters of VMs to process data in parallel. This makes EMR more suited to processing extremely large data sets.

EMR is more suited to unstructured data and uses a schema-on-read approach similar to Athena.

When to use EMR

A company might use EMR to process huge swathes of social media data. For example, a global B2C brand might use EMR to interpret all interactions and mentions of their brand, using NLP to extract customer sentiment from that data.

This large-scale data analysis is made possible by EMR’s distributed computing approach, using Apache Spark, for instance, or another open-source big data framework.

The takeaways

Amazon Redshift is the serverless or provisioned service for high-performance queries on structured data
Amazon Athena is the serverless service for ad hoc queries on unstructured data
Amazon EMR is the serverless or provisioned service for large-scale data queries and transformations using open-source frameworks

How we can help

We’re AWS Advanced Consulting and Well-Architected Review Partners – which means we can help out with any AWS solution.

We specialise in 24/7 support of your infrastructure and application, as well as architecting new solutions, migrating or optimising existing workloads for cost, performance and more.

Plus, if your infrastructure or app is all about delivering great customer experiences – we’re AWS Digital Customer Experience Partners, too.

So if you need help with your AWS data, or for anything else, just get in touch.

To learn more about AWS, check out our pieces on EBS vs. EFS vs S3 and RDS vs. Redshift vs. DynamoDB vs. Aurora.

by Ned Hallett

Just After Midnight’s nice and SaaS-y guide to SaaS architecture

Safeguarding On-call Tech Employees from Burnout

Choosing a cloud service provider for fintech [downloadable]

Redshift vs. Athena vs. EMR – AWS’ big data solutions explained