This blog details my architecture and design choices for a system that automates tasks like processing Excel files and web scraping. This system enables its users to run jobs on demand fully automated on the cloud.
Architecture
Key Components
Web Application
The web application is built using Next.js deployed to AWS using sst.dev which uses S3 for static assets and pages and Lambda for anything dynamic (api routes, middleware etc). The website is continuously deployed using Github Actions. All the logs are available in CloudWatch in case anything goes wrong. The Web Application has 2 functions: managing users and managing operations.
EC2 Worker
An EC2 Worker is dynamically spun up for each new operation by the web application backend using the aws cdk (why a new one for each operation? I'll answer that later). The backend server pulls the user-data file from S3, feeds it to the EC2 instance which takes it from here. This allows for the user data to be updated to change the behavior of the instance or incorporate temporary changes until a new ami can be deployed.
Custom AMI
The custom AMI is created using Ansible so that changes can be easily made and audited. It uses the Amazon Linux 2023 as its base. It is configured with the git repository access, secrets access via doppler, configured to sync dependencies that may be added or changed using pipenv (pip libraries and python versions). It is also configured to send logs to grafana loki via promtail which can then be accessed from (in our case) grafana cloud.
The Automation Program
This is the program that runs on EC2 and actually does all the heavy-lifting and is written in python. It is responsible for downloading the file from S3 and performing various operations and uploading the resulting file back to S3. What kind of operations? Operations are mainly of 2 types: Excel file operations and Web-Scraping. The details of how these operations work are not really important, just know that it could end up using 100-200 threads for some operations.
Workflow
User logs into the web interface with one of the IdP (eg. Google) (allows authorized users to add users or manage operations)
User creates an operation and uploads the input file to S3
The backend server downloads the user data script from S3 and uses that to create a EC2 instance using aws cdk using the custom ami which is deployed to a private vpc managed using Terraform
Once the EC2 starts up, the user data script pulls any changes from the git repository, syncs dependencies, and starts the program.
The program pulls the required files from S3, sends a confirmation mail to the user using SNS and starts the operation.
During the execution of the program, the progress is being updated in real-time to the user dashboard, and the logs are being sent to grafana loki
After successfully completing the operation, the output file is uploaded to S3 and an completion email is sent to the user (an error mail is sent when it occurs in case of any errors along with error details)
The instance is then terminated
The user can now download the result from the dashboard
Benefits of this approach
Adaptability
Everything from the EC2 startup behavior (using the dynamic user data script) to the dependencies (using pipenv) and the secrets management (using doppler) can be adjusted on the fly. It enables me to change the entire business logic and approach if needed without changing the ami only bound by the language chosen. Since the custom ami is created using Ansible, making changes to the actual ami is also very easy.
Observability
Each actor can see stats that are useful for them, the end user can see the progress of the operation update in real time (via postgres and polling for changes), the administrators can see the logs of the applications in real time in grafana and with a little LogQL can easily find anything specific they are looking for. The system also notifies users when starting or completing an operation, and also when an error occurs (incorrect format used etc.) via email so that the user knows about the state of the system.
Why...
create a new instance for every operation?
The program utilizes multi-threading (upwards of 100-200 threads) and maxes out the CPU utilization of the EC2 instance (about 90% after testing around with different sizes) for almost every operation type. So running multiple operations on a single instance will have no benefit. Secondly the client's priority was completing the operation quickly and was ready to pay a little extra for the "extra" performance.
not use Moby Dock?
Docker simplifies dependency management and prevents dependency hell which is useful in a lot of scenarios and is in mine too. So why not use it? Well... because I didn't have to. I found an ansible role to correctly install pipenv and pyenv install was just 2 steps, and a simple pipenv sync
took care of all my dependency management. If this would have caused any trouble, then I would have turned to docker but I didn't need to.
Improvements
This system was not designed for the cloud from the ground up instead as a port of an existing application, as a result it does not utilize the cloud to its full potential.
Fault Tolerance
Right now if the operation fails due to any external reasons (eg. external api failure etc) the operation fails and the user has to manually restart the operation. To address this, I could introduce SQS and automatically add it to the queue again if operation fails.
Cost Optimizations
Each operation can be divided into multiple smaller operations such that these smaller operations can be grouped together in batches of 5-6 and given its own spot instance which is upto 90% cheaper than using regular EC2 instances and I could switch over to regular EC2 instances for operations which are taking too long and not possible on spot instances. To enable this strategy I would have to make my system fault tolerant but would greatly decrease costs and increase speed and performance.
So why not implement these improvements, well for one these improvements would have added a lot of development time and possibly redeveloping the application from the ground up. Secondly, there is always something better, a better tool, a better framework, I have been stuck in this cycle before which lead to a project never getting developed, so I tried to make a system that fulfilled my client's needs best, in the given time constraints. That being said, these improvements are highly beneficial and may be implemented in the next iteration.
Wrapping up!
This system freed up a lot of time for the internal users of the company and also removed any chance of errors (I actually caught errors in multiple operations done manually). It helped me realize the redundancies that have to built in a production system and the importance of monitoring your applications and setting up alerts to catch anything wrong (especially when you are dealing with cloud).
I am still a beginner and I'm sure that there are things I could have done differently and more efficiently. If you have any improvements in mind please drop them in the comments. I am always ready to learn and improve my solutions. If you want to have a discussion with me reach out to me on LinkedIn. Thank you for reading my blog!