Snowflake external functions with AWS Lambda and Serverless framework
Background:
I’m a Data Engineer working at a small start-up in the lovely Berlin (ok, in winter it’s less nice, like most northern Europe).
Almost 2 years ago i took one of the best decisions and decided to change our data-warehouse infrastructure to Snowflake (the only real cloud-based data-warehouse), it was actually “built for the cloud” as they say.
Motivation:
One of the challenges i was facing was extracting the Adjust (our marketing attribution tool) device id, based on a field called AID (advertiser id) that lies in our backend system (and loaded to our DWH).
Adjust provides an API endpoint that returns all the basic users’ data — https://help.adjust.com/en/article/device-api
What are we doing here?
I’ll guide you step by step (6 steps in total) how to use Serverless CLI and AWS Lambda + API Gateway in order to create an external function in Snowflake.
Prerequisite:
- AWS account with admin permissions.
- Serverless CLI installed as explained here.
- Snowflake account with ACCOUNTADMIN role.
Possible solutions:
- A very simple though tedious and in efficient solution was to write a python script that would reach our backend, get all users’ AID, calls the API and creates a csv file as and output and then load the file to Snowflake. (this all done external to Snowflake which is less secure and much less convenient to use)
- Use the great feature (which i found out about not long ago) called “external functions” in Snowflake, which allows calling AWS Lambda functions from your SQL, and returning the response from Lambda as a column, as part of the SQL result. something like this:
looks much better, right? so i’ve chosen this solution (Obviously!)
How (do we create an external function in Snowflake):
That’s the obvious question of any technical solution. i know that in theory, but how do i make it work.
So i started by using a very nice tutorial done by Snowflake https://www.youtube.com/watch?v=qangh4oM_zs
Everything worked fine and as expected, but! I don’t and won’t do anything manual on AWS (ok, maybe only creating IAM roles or very simple specific things).
Although devOps is not my main role, when you work at a startup you have to learn some devOps to make things work smoothly.
Therefore I’ve decided to go with Serverless, which is a framework that helps you define and deploy serverless infrastructure.
Step 1 — Serverless definition:
The Serverless definition is done using a yaml file:
** make sure to replace the information in the angle brackets <>
#serverless.ymlservice: snowflake-function-<your snowflake function name>
plugins:
- serverless-python-requirements
- serverless-offline
custom:
stage: ${opt:stage, self:provider.stage}
settings:
dev:
STAGE: ${self:custom.stage}
prod:
STAGE: ${self:custom.stage}
pythonRequirements:
dockerizePip: non-linux
resourcePolicies:
dev:
- Effect: Allow
Principal: "*"
Action: execute-api:Invoke
Resource:
- execute-api:/*/*/*
prod:
- Effect: Allow
Principal: "*"
Action: execute-api:Invoke
Resource:
- execute-api:/*/*/*
provider:
name: aws
region: <your AWS region>
environment: ${self:custom.settings.${self:custom.stage}}
profile: <your AWS profile>
runtime: python3.8
timeout: 30
role: '<an AWS AIM role that has the relevant permissions>'
resourcePolicy: ${self:custom.resourcePolicies.${opt:stage, self:provider.stage}}
functions:
<your function name>:
handler: lambda_function.lambda_handler
events:
- http:
path: /
method: post
cors: true
authorizer:
type: aws_iam
Step 2 — The Lambda function code:
(taken from the Snowflake documentation)
import json
def lambda_handler(event, context):
# 200 is the HTTP status code for "ok".
status_code = 200
# The return value will contain an array of arrays (one inner array per input row).
array_of_rows_to_return = [ ]
try:
# From the input parameter named "event", get the body, which contains
# the input rows.
event_body = event["body"]
# Convert the input from a JSON string into a JSON object.
payload = json.loads(event_body)
# This is basically an array of arrays. The inner array contains the
# row number, and a value for each parameter passed to the function.
rows = payload["data"]
# For each input row in the JSON object...
for row in rows:
# Read the input row number (the output row number will be the same).
row_number = row[0]
# Read the first input parameter's value. For example, this can be a
# numeric value or a string, or it can be a compound value such as
# a JSON structure.
input_value_1 = row[1]
# Read the second input parameter's value.
input_value_2 = row[2]
# Compose the output based on the input. This simple example
# merely echoes the input by collecting the values into an array that
# will be treated as a single VARIANT value.
output_value = ["Echoing inputs:", input_value_1, input_value_2]
# Put the returned row number and the returned value into an array.
row_to_return = [row_number, output_value]
# ... and add that array to the main array.
array_of_rows_to_return.append(row_to_return)
json_compatible_string_to_return = json.dumps({"data" : array_of_rows_to_return})
except Exception as err:
# 400 implies some type of error.
status_code = 400
# Tell caller what this function could not handle.
json_compatible_string_to_return = event_body
# Return the return value and HTTP status code.
return {
'statusCode': status_code,
'body': json_compatible_string_to_return
}
Things to notice in the Lambda function:
- input_value_1 - the first Snowflake external function argument.
- input_value_2 - the first Snowflake external function argument.
you can add or remove them based on your needs.
Deploy the Lambda function to AWS by typing:
serverless deploy --stage dev (or prod)
When the deployment is done, you’ll see the API Gateway url in the terminal. keep it for later
Step 3 — Create a role in AWS AIM:
Create an AIM role called “Cross account”, that gives permissions to another account to operate in yours (basically giving your Snowflake instance option to execute your Lambda function using a trust relationship that we will add later)
In the Account ID put your own account ID (this is a bit confusing, it will be cleared out later on)
When the role is created, copy the Role ARN and keep it for later use.
Step 4— Create a Snowflake “Integration” and external function:
**make sure to run all the commands in Snowflake with an ACCOUNTADMIN role (or with a role that has full permissions for creating integrations and external functions)
create or replace api integration my_api_integration_01
api_provider = aws_api_gateway
api_aws_role_arn = '<AWS role ARN from step 3>'
enabled = true
api_allowed_prefixes = ('<api gateway url from step 1/2>')
;
now we need to get 2 properties that will help us create the “trust relationship” between our AWS account and our Snowflake account.
execute the following (in snowflake):
describe integration my_api_integration_01
The result would look as follows:
Keep “API_AWS_IAM_USER_ARN” and “API_AWS_EXTERNAL_ID” for later
Step 5 — Create the “trust relationship” in AWS AIM:
Search for the role you’ve created earlier and click on the “Trust relationships” tab, then click on “edit trust relationship”.
paste the following to the policy document text box:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "<API_AWS_IAM_USER_ARN>"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": [
"<API_AWS_EXTERNAL_ID>"
]
}
}
}
]
}
** replace the variables in the angle brackets from step 4
Step 6 — Create the external function in Snowflake:
execute the following statement in Snowflake-
create external function my_external_function(n integer, v varchar)
returns variant
api_integration = <api_integration_name>
as '<api gateway url from step 1/2>'
;
How to use:
once everything is in place, you could run the following:
select my_external_function(99, 'Luftballoons');
The result is a variant field that you could parse.
Conclusions:
- The process of creating and understanding the relationship between all the moving parts is not simple by using the documentation, especially if we don’t want to create everything manually in AWS.
- Using Serverless saves a lot of time.
- I love Snowflake and the flexibility + full control and transparency on what’s happening.
Please comment here in case you have questions and i’ll do my best to respond