Using AWS temporary credentials with Hadoop S3 Connector

BQ Qiu
4 min readMar 17, 2021

--

A few days ago I was trying to get a Spark app to access another AWS account by calling the AWS STS AssumeRole API. Let’s say there are environments A and B. The app resides in environment B and wants to get access to AWS S3 data in environment A.

I have set up the following: 1) IAM role with the necessary permissions in environment A, 2) necessary S3 bucket policies in environment A, 3) allow the original IAM profile of the app to assume role in (1), 4) give necessary permissions to the EKS pods in which the Spark workers reside, as we deploy Spark on AWS EKS (Elastic Kubernetes Service). Here is a good guide of steps of follow.

However, I was still getting access denied errors when trying to access data with Spark, for example using spark.read.parquet. Something like this:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.nio.file.AccessDeniedException: <file_path>: getFileStatus on <file_path>: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: xxxxxxx; S3 Extended Request ID: xxxxxxx

I tried calling aws s3 ls and aws s3 cp in the app outside of the Spark job, and data can be accessed, confirming my suspicion that it is indeed access denied to Spark only.

Then, I exec into a running Kubernetes pod following instructions here and found that I could still access the data. Since the Spark worker has no access issues, the problem must be with the Spark driver.

This brings me to the Hadoop S3A Client which we use to give Spark high-performance I/O against S3. For authentication, the documentation has this to say:

By default, the S3A client follows the following authentication chain:

1. The options fs.s3a.access.key, fs.s3a.secret.key and fs.s3a.session.key are looked for in the Hadoop XML configuration/Hadoop credential providers, returning a set of session credentials if all three are defined.

2. The fs.s3a.access.key and fs.s3a.secret.key are looked for in the Hadoop XML configuration/Hadoop credential providers, returning a set of long-lived credentials if they are defined.

3. The AWS environment variables, are then looked for: these will return session or full credentials depending on which values are set.

4. An attempt is made to query the Amazon EC2 Instance Metadata Service to retrieve credentials published to EC2 VMs.

I checked and there are no keys defined in the Hadoop XML configuration. We have been using option 3, AWS environment variables, for authentication. After calling AssumeRole, we set three environment variables: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN. Following the authentication chain, Hadoop should check the environment variables and be able to use the AWS environment variables as credentials.

What is missing? After many searches, I stumbled on this guide, which points out the need to set fs.s3a.aws.credentials.provider to com.amazonaws.auth.DefaultAWSCredentialsProviderChain so Hadoop will work with AWS credentials set in a credentials file, ie. in ~/.aws/config and ~/.aws/credentials. Note that this is NOT the same as what I was trying to do. However, I noticed this setting was missing from our Spark config. After adding it in, the problem was resolved.

The fs.s3a.aws.credentials.provider setting is documented as follows:

If unspecified, then the default list of credential provider classes, queried in sequence, is:    1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider: Uses the values of fs.s3a.access.key and fs.s3a.secret.key.    2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports configuration of AWS access key ID and secret access key in environment variables named AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK.    3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use of instance profile credentials if running in an EC2 VM.

Looking up the documentation or code for com.amazonaws.auth.DefaultAWSCredentialsProviderChain, I see that it chains together several AWS authentication methods, the first of which is EnvionmentVariableCredentialsProvider which checks AWS environment variables (documentation here). In particular, if AWS_SESSION_TOKEN is specified, temporary credentials will be used, which is what we want.

This is the same as option 2 as claimed in the Hadoop documentation if the credentials provider is unspecified. However, the outcome is different.

Tracing the code for the Hadoop implementation for the unspecified AWS credentials provider, I got to the EnvironmentVariableCredentialsProvider implementation, and found that it only checks for secretId and secretKey. This may be the crux of the problem as we need the environment variable credentials provider to also check for the session token. The temporary credentials without the correct session token are invalid.

However, what I discovered is different from what is stated in the documentation, which is that com.amazonaws.auth.EnvironmentVariableCredentialsProvider will be used. It is possible I am mistaken on the cause of the issue. I will update again if I find that is the case.

Ps. This is the first of notes I intend to publish once in a while to add on to the online store of knowledge for software engineers around the world, to reduce time spent troubleshooting or debugging. I have personally found such guides hugely beneficial to my own work, hence want to contribute back as well.

--

--

BQ Qiu
BQ Qiu

Written by BQ Qiu

Computer network research, data pipeline engineering, infrastructure as code

No responses yet