How We Migrated to Apache Iceberg Utilizing: Athena, Trino and Spark

Sharon Lavie
6 min readFeb 24, 2025

--

In the previous story I wrote about the decision making process and why we migrated from Redshift to Iceberg

This time I’m going to describe the migration itself.

The migration in numbers:

With a team of 4 people : 1 Data engineer, 2 analytics engineers, 1 team lead.

Main principles

  1. Minimize data availability disruption
  2. Ensure data consistency and comparison between the old and new systems
  3. Automate processes wherever possible

What did we have to migrate

  1. The data itself (~70TB on Redshift)
  2. Airflow Dags (While replacing some of them with Airbyte)
  3. Data transformations done mainly with SQL statement (executed by Airflow)

Infrastructure

  1. Airflow (on K8s): To run both systems in parallel, we deployed a second Airflow instance on Kubernetes.
  2. Trino (on K8s): Initially not part of the plan, Trino was added midway through the migration as we discovered that certain tables would benefit significantly from sorting.
  3. Spark (Using AWS Glue Jobs): Similar to Trino, Spark was introduced later when we encountered limitations with Athena while running complex, resource-heavy queries.
  4. Airbyte (on K8s): We also integrated Airbyte to streamline data ingestion and reduce reliance on numerous scripts and DAGs pulling data from various Postgres tables and APIs.
  5. Glue as the Data Catalog: In the next story, we’ll explore data catalogs for Apache Iceberg and why we chose Glue as our Data Catalog.
  6. S3: Given that Iceberg requires an object storage platform, we chose S3 as it aligns with our AWS infrastructure.

Data Migration

Our first step was to identify the Redshift tables that needed migration. We compiled a list of these tables and converted them into Iceberg tables using Athena. While doing this, we ensured that essential metadata (such as Distribution and Sort Keys), column names, and data types were preserved.

To automate this process, we developed a Python script that:

  1. Iterates through all the tables to be migrated (validated against Airflow and Looker code to avoid dropping essential tables).
  2. Unloads data from Redshift to S3.
  3. Creates a temporary Athena table.
  4. Creates an Iceberg table.
  5. Merges data from the temporary Athena table into the Iceberg table.

Airflow DAGs Migration

For migrating data pipelines, we followed three core principles:

  1. Use Athena whenever possible to optimize query execution and reduce costs.
  2. Use Trino when Athena doesn’t support certain table elements (Sorting mainly) or in case the statement opens more than 100 partitions when writing. more details on AWS documentation
  3. Use Spark when Athena cannot run the required DML statements.

This process was mostly manual. We created a basic DAG template that included the necessary Airflow Operators, such as AthenaOperator, and defined essential variables.

Additionally, we developed a custom AthenaOperator capable of running multiple queries from a single SQL file. This functionality was originally available for Redshift, and we wanted to maintain it without breaking our existing SQL files into multiple smaller ones.

SQL Migration

A significant part of the migration involved converting approximately 600 SQL files from Redshift to Athena. We initially attempted to automate this process using various AI tools and SQLGlot. However, both approaches had limitations that resulted in inconsistencies and errors. As a result, we had to adopt a hybrid approach, combining automation with manual adjustments to ensure accuracy.

The Process

As part of our strategy, we aimed to have two Airflow instances running in parallel by the end of the migration.

Once a DAG migration was completed, we re-migrated the relevant tables from Redshift to ensure we had the most up-to-date data. We then executed the migrated Airflow DAG alongside its Redshift counterpart to verify correctness and stability.

By the end of the process, we maintained two synchronized copies of the data — one in Redshift and one in Iceberg — both refreshed daily to ensure consistency.

Looker

Looker is our primary data visualization tool, and its migration was relatively seamless.

Setting up a new database connection to Athena was straightforward. However, we encountered a limitation: Looker required AWS Secret and Access Keys instead of allowing AssumeRole, which would have been the preferred approach.

To migrate, we followed these steps:

  1. Created a new GitHub branch in Looker based on our main branch.
  2. Updated the database connection to point to Athena.
  3. Verified that 90% of our existing Looks and Dashboards functioned correctly.
  4. Addressed the remaining 10% of issues, which primarily stemmed from how Looker handles timestamps in Athena.

This step had to be taken care of quickly, since we decided on “code freeze” (unless something very urgent came up), so we won’t have to make changes on both branches.

Validation

To ensure data integrity, we implemented two types of validation:

  1. Database Table-Level Validation: We migrated a sample of each table (typically a few days’ worth of data) to a temporary schema in Athena. We then performed a direct comparison between the temporary table and the final destination table to verify consistency.
  2. Looker Dashboard Validation: We compared the final results in Looker dashboards to ensure the numbers aligned correctly with the original data.

This validation process was crucial in identifying and resolving several data pipeline issues before fully transitioning to the new system.

Challenges

Athena

Athena performed well overall but had some unexpected limitations:

  1. Resource Exhaustion: At times, we encountered errors due to limited resources.
  2. Partition Limits: Athena enforces a restriction of 100 open partitions, often leading to HIVE_TOO_MANY_OPEN_PARTITIONS errors.

Spark

Our team had limited experience with Spark, which added complexity when we realized Athena could not handle all required DML statements. While Spark proved useful, debugging and troubleshooting were more challenging due to this learning curve.

Airbyte

One major challenge was the lack of support for Iceberg tables when using Glue as the catalog and Athena as the query engine. To address this, we developed a custom Airbyte destination. While it functions, it still has some limitations. We are currently awaiting an official Airbyte destination that meets production standards so we can replace our custom implementation.

Lessons Learned: What I Would Have Done Differently

Airbyte

While Airbyte is a powerful tool, introducing it during the migration added complexity, especially given its lack of support for Iceberg with Glue as the catalog. In hindsight, it would have been better to postpone its adoption until after the migration was completed.

Apache Iceberg: Table Optimization

A deeper understanding of Iceberg’s table optimization features — especially regarding compaction — would have been beneficial. Leaving compaction as a post-migration task resulted in additional work that could have been mitigated with better planning upfront.

Athena

Discovering Athena’s limitations earlier in the process would have allowed us to proactively incorporate alternative solutions like Spark and Trino from the start. This would have reduced stress (and sleepless nights) and minimized disruptions during the migration.

Final Thoughts

In the end, the migration went smoothly, and we were pleasantly surprised by how quickly things progressed. We successfully adhered to our three core principles: there was no disruption to data availability, both systems ran concurrently for effective comparison, and while more automation could have been implemented, it might have extended the timeline.

Looking Ahead

In the next and final post in this series, we’ll dive into the technical challenges we faced, focusing on Apache Iceberg’s table compaction and optimization strategies. We’ll also cover the key learnings that helped streamline our migration and performance improvements. Stay tuned for more insights!

--

--

Sharon Lavie
Sharon Lavie

No responses yet