Securely Loading a CSV File from S3 to Snowflake Table Using AWS IAM Role and External ID

Securely Loading a CSV File from S3 to Snowflake Table Using AWS IAM Role and External ID

Introduction

In the world of big data, one common task that professionals often need to perform is loading CSV files into Snowflake tables from AWS S3 storage. These CSV files may be of varying sizes, ranging from a few kilobytes to several gigabytes. In some cases, the CSV files may even be encrypted to maintain data security. In this article, we will focus on securely loading a CSV file from AWS S3 into a Snowflake table using the AWS Identity and Access Management (IAM) Role and an External ID.

AWS S3 and Snowflake: An Overview

AWS S3 (Simple Storage Service) is a widely used scalable object storage service from Amazon Web Services (AWS). It enables users to store and retrieve any amount of data at any time from around the globe.

On the other hand, Snowflake is a popular cloud-based data warehousing platform that offers data storage, processing, and analytical solutions. Snowflake's unique architecture allows for easy scaling, high performance, and data-sharing capabilities.

AWS IAM Role and External ID: Ensuring Security

AWS IAM (Identity and Access Management) Roles are a secure way to grant permissions to entities that you trust. These entities can be AWS service instances, user accounts, or even applications, to carry out actions on your behalf.

The External ID is a feature in AWS that is used in third-party scenarios where you need to grant cross-account access. It's a way to mitigate the confusion of who gets what access, thus ensuring that the permissions are being given to the correct AWS account.

Steps to Load CSV File from S3 to Snowflake

Step 1: Create storage integration - a secure and easy way to access external cloud storage, such as Amazon S3.

Note: All substitution variables inside {} need to be replaced with specific names from your Snowflake and AWS account that you create. Using the use command choose the database and schema that you want to create the storage integration as follows:

 USE {DATABASE_NAME};

 USE SCHEMA {SCHEMA_NAME};
CREATE OR REPLACE STORAGE INTEGRATION {storage-integration-name}
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'S3'
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::{account-id}:role/{AWS-IAM-rolename}'
  STORAGE_ALLOWED_LOCATIONS =
('s3://{s3-bucket-name}')

Step 2: Create an IAM policy and IAM Role

Firstly, you'll need to create an IAM policy and assign the policy with the necessary permissions to the IAM role to access the S3 bucket. Ensure that the trust policy of the IAM role includes Snowflake's AWS account ID and the external ID provided by Snowflake.

-- IAM Policy for the role to have access to the AWS s3 bucket
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:DeleteObject",
                "s3:DeleteObjectVersion"
            ],
            "Resource": "arn:aws:s3:::{s3bucketname}/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::{s3bucketname}",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "*"
                    ]
                }
            }
        }
    ]
}
-- IAM role named {AWS-IAM-rolename} with trust policy that includes the snowflake user arn and snowflake external id
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "{STORAGE_AWS_IAM_USER_ARN FROM SNOWFLAKE STORAGE INTEGRATION}"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "{STORAGE_AWS_EXTERNAL_ID FROM SNOWFLAKE STORAGE INTEGRATION}"
                }
            }
        }
    ]
}

Step 3: Grant Access to Snowflake

You'll need to grant Snowflake access to the S3 bucket. To do this, create a Snowflake stage that references the S3 bucket and the IAM role ARN (Amazon Resource Name).

--// Creating the role and permissions to create stage and use storage integration

CREATE OR REPLACE ROLE {AWS-IAM-rolename}

GRANT CREATE STAGE ON SCHEMA external_stages TO ROLE {AWS-IAM-rolename};

GRANT USAGE ON INTEGRATION {storage-integration-name} TO ROLE {AWS-IAM-rolename};

--// Creating the  external stage

CREATE STAGE aws_stage
  STORAGE_INTEGRATION = {storage-integration-name}
  URL = 's3://{s3bucketname}'

Step 4: Create a sample snowflake table and copy data into the table from s3

Now, you're all set to load the data into Snowflake. Use the COPY INTO <table> command to load the CSV file from the stage into the Snowflake table. You can also use the FILE FORMAT option to specify that the data is in CSV format.

CREATE OR REPLACE TABLE {table_name}
(
    ID INT,
    name VARCHAR(30)
    );

-- assuming there is 1 csv file in the s3 bucket that has 2 fields named id and name and first row is the header.If there are more files use  pattern keyword to specify the pattern to limit the csv file to pick;
COPY INTO {table_name} 
     FROM @aws_stage
     file_format=(type = csv field_delimiter=',' skip_header=1)
     --pattern=

-- // Selecting the data to confirm the load

SELECT * FROM {table_name};

Conclusion

The process of loading a CSV file from AWS S3 into a Snowflake table can seem daunting due to the numerous security concerns involved. However, by leveraging the AWS IAM role and External ID, you can maintain the security of your data while efficiently performing the task.

One of the significant advantages of this process is that sensitive credentials such as AWS Access Key ID and Secret Access Key are not exposed anywhere in the code. This characteristic is essential because it prevents the exposure of credentials that could potentially be misused if they fell into the wrong hands. Thus, this solution ensures the secure handling of data without compromising functionality.

Remember, while this article offers a general guide, always ensure to tailor these steps according to the specific security requirements and policies of your organization. By following these steps, you can achieve a robust, efficient, and secure data pipeline from AWS S3 to Snowflake