Unzipping S3 files back to S3 without uncompressing entire file (Streaming)
I work as a consultant for The College Board. CollegeBoard uses AWS cloud technologies. It is a very interesting place to work. I learned a lot of new technologies at CollegeBoard.
My task was to unzip a file on S3 bucket and upload back to S3. Unfortunately, S3 does not have a an unzip feature. I decided to use Lambda function to unzip a file and upload back to S3. The problem with Lambda is that Lambda has a memory limit of 3008MB. 500MB zipped file is unzipped to 5GB file. Thus, Lambda doesn’t have enough memory to unzip 500MB zipped file. Another solution was to use EC2 and use as a dedicated EC2 instance that unzips a file. This will solve a space issue. The problem with having EC2 approach is that it requires more maintenance. The decision was to get file in chunks. This would prevent Lambda from running out of memory.
Here is the flow:
- User puts and object to AWS S3 bucket
- S3 bucket triggers AWS Lambda
- AWS Lambda gets a chuck of a file
- AWS Lambda unzips that chuck of a file
- AWS uploads that part of the file
- AWS Lambda Repeats 4 and 5 till entire file is unzipped and uploaded
- Email is sent after the file is unzipped
Step 1: Get file in chunks
AWS SDK has a function that streams file from S3. createReadStream() gets file in chunks.
const body = s3
.getObject(params)
.createReadStream() //this will get file in chunks
Step 2: Unzip streamed chunk of a file
This is the hardest part to implement. I searched node.js unzip plugins. I prefer using the most downloaded plugins, because they have more support and have less chance to be abandoned. When a lot of developers use the same plugin, the plugin will have less errors, there would be more examples to copy from.
The most popular unzip plugin is YAUZL (Yet Another Unzip Library). It has 6 million downloads per week. But the problem with YAUZL is that it doesn't support streaming of a file. In the file YAUZL states:
Due to the design of the .zip file format, it’s impossible to interpret a .zip file from start to finish (such as from a readable stream) without sacrificing correctness. https://www.npmjs.com/package/yauzl#no-streaming-unzip-api
I tried a lot of other plugins, but I could not find a plugin that can unzip a streamed file and streams back.
The plugin that can unzip a file from a stream and stream back is Unzipper https://github.com/ZJONSSON/node-unzipper
A code example show on a plugin page perfectly fits my task:
fs.createReadStream('path/to/archive.zip') // stream file
.pipe(unzipper.ParseOne({forceStream: true})) // unzip a file
.pipe(fs.createReadStream('firstFile.txt')); // stream back
In the code above I needed to replace “fs” to “s3”. It took me two days to come up with the code. I did not work with node streams before [https://nodejs.org/api/stream.html]. Even after writing a code, I was not sure whether the code is streamed or buffered in the memory before streaming. After talking to Ziggy Jonsson, the owner of the code, I could solve the issue. I truly appreciate his help.
Here is the unzipping part of the code, including the streaming from s3.
const body = s3
.getObject(params)
.createReadStream()
.pipe(unzipper.ParseOne(filePath, { forceStream: true }))
.on("end", () => {
console.log("end");
});
The code above will stream a single file from s3 and unzip a streamed part. The memory usage is less than 200MB, which perfect for Lambda memory limit.
Step 3: Stream an unzipped chunk of a file back to S3
This part is also supported by AWS SDK natively.
s3.upload({ Bucket: bucketName, Key: filePath, Body: body }).promise()
Step 4: Send an email when the file is unzipped
This is straightforward script from AWS SDK example pages:
AWS.config.update({ region: "us-east-1" });let message = {
Message: JSON.stringify({ Bucket: bucketName, Key: filePath }),
TopicArn: "<YOUR-TOPIC-ARN>"
};
var publishTextPromise = new AWS.SNS({ apiVersion: "2010-03-31" })
.publish(message)
.promise();
<YOUR-TOPIC-ARN> is an ARN of an SNS topic. I will show how to create and get an ARN of a topic below.
STE 4.1 Create an SNS Topic
- Login to AWS Console and go to SNS Topics tab: https://console.aws.amazon.com/sns/v3/home#/topics
- Click on “Create topic” on the top right corner of the page and fill in the Name field
- On the created topic click on a create Subscripotion
- You can choose any of the notification types from the drop down list. You can send an Email, SMS, Trigger a lambda that will continue working with the extracted file or Create an SQS message. If you want to invoke a Lambda funcion, I suggest using SQS. SQS will ensure that the message is not lost if directly invoked Lamda throws an error in the middle of process.
Github link
Here is the link to a working GitHub file:
PS: I would appreciate if you leave a comment. It is my first story on Medium.