Company Name | CallEvo |
Case Study Title | Custom Software Serverless Deployment and DevOps |
Case Study Short Description | How Triumph Tech helped CallEvo, a communications company for the political space, use DevOps and Serverless to run and update its custom software. |
Problem / Statement Definition | CallEvo was looking to both deploy and set up DevOps for its custom Software. Their application was initially developed to run within a monolithic framework. Triumph Tech was brought in to help “decouple” their architecture and speed up application delivery. |
Proposed Solution and Architecture | Triumph chose AWS to provide the cost efficient resources in order to run CallEvo’s custom software as well as speed up the delivery of application updates.
· We used a Lambda function to run the backend “api layer” to process requests in response to events from the front-end. · We used S3 and CloudFront in order to serve static assets for the “frontend.” · We used Postgres Aurora Serverless as our data layer. · We used API Gateway to send and receive dynamic content from the Lambda backend. · We utilized Elasticache Redis to provide a cacheing layer for requests made to the backend. · Auth0 was used to register and authenticate users to the application. · CodePipeline, CodeBuild, and CloudFormation were used in order to rapidly deploy updates to the application. · Systems Manager was used to securely store parameters required by the appliction. · We used SQS to increase performance of the application and efficiently process transactions sent / received by the software. |
Outcomes of Project & Success Metrics | During the initial discovery phase, we did a deep dive with the client in order to understand their business requirements so Triumph could build out the right infrastructure and DevOps process.
We discovered that the development team did not have a process in place in order to rapidly deploy changes to their application nor did they have an efficient way of running the application. Project metrics were determined post-buildout by the execution of load tests and the speed at which application changes could be deployed. We needed to know that the environment could handle at least 100 requests per second in under 2000MS per request with a low error rate. Additionally, we needed to find out the amount of time required to deploy code changes to the environment. Locust was used to load test by simulating common user behaviors that trigger read / write operations on the data layer. We simulated 1000 users and a request rate of 100 RPS. We found that the application could respond in <500MS with a 0% error rate. This was a success. In order to test the ability to rapidly push code changes to the application, we simply calculated the amount of time required for CodePipline to execute all tasks to completion. The total time was <10 minutes and therefore deemed a success. We compared this against the amount of time required to manually push changes to the environment, which was around 1 hour and 20 minutes until changes were live. |
Describe TCO Analysis Performed | TCO was calculated based on the amount of time to manually push code to the environment and the resources required to do so. |
Lessons Learned | Serverless functions are scalable and a cost-effective way of running production workloads.
Automated deployment solutions greatly reduce the amount of time required to deploy code to production. |
Summary Of Customer Environment | Cloud environment is native cloud. The entire stack is running on Amazon Web Services. Stack is being deployed in the US-West-2 region. |
AWS Account Configuration
- Root User is secured and MFA is required. IAM password policy is enforced.
- Operations, Billing, and Security contact email addresses are set and all account contact information, including the root user email address, is set to a corporate email address or phone number.
- AWS CloudTrail is enabled in all regions and logs stored in s3.
Operational Excellence
Metric Definitions
CodePipeline Health Metrics
If any step within the pipeline fails, notifications are sent out to the DevOps channel within Slack. This is achieved via an integration between SNS topics and AWS Chatbot integration with Slack.
Lambda Health Metrics
Lambda health is determined by success / failure of the Lambda function. The most important metric is error count and success rate (%).
Lambda health is further defined by using Xray application tracing in order to effectively trace and debug requests made to the lambda function. Application exceptions (errors) are viewed within Xray and help enhance the performance of the application.
Metric Collection and Analytics
We consult clients on best practices in terms of log / metric collection. For application related logs we prefer the use of an ELK stack, which takes advantage of AWS Elastic Search Service, Logstash running on EC2, and Kibana. This allows for complete security and granular control over log collection and visualization.
To automate the alerting of unhealthy targets of an application / network / or classic ELB, we consult our client on the use of CloudWatch alarms, SNS alarm trigger notification, and AWS Lambda. The Lambda function makes a describe-load-balancer or describe_target_groups API call to get the identity of the failed target as well as the cause of the failure and then triggers an email notification via SNS with the discovered unhealthy host details.
We recommend the use of Grafana running on EC2 and Prometheus for the monitoring of individual workloads running within a stack. EC2, RDS, Container, EKS, and ECS metrics are collected by Prometheus and data visualized via dashboards within Grafana.
Operational Enablement
Enabling the client to manage and maintain the DevOps pipeline after handover is of the utmost importance. Our goal is to always minimize the amount of maintenance that will be required with the level of automation. Our goal is for all members of the Development team to be able to simply push code, follow a development process, and know that their applications are being tested and rapidly deployed.
Training and handover are always included in scope. This process includes the development of documentation specific to the customer workload. It outlines the development lifecycle from source control and branching all the way through deployment.
We document how to version IAC modules / templates that were developed and push out updates to their infrastructure.
We provide architecture diagrams, which outline the branch strategy / git workflow.
Lastly, we schedule a video conference, and do a “hands on” session with the client where we go over how to push application updates throughout the development, staging, and production environments. We go over the development workflow and branching strategy.
- We show the client how to troubleshoot a failed pipeline build within CodeBuild. We show the client where to find all relevant logs in relation to their build and test stages within CodePipeline should they occur. The majority of DevOps related troubleshooting tasks after a CI / CD automation pipeline is properly set up will be found within the CodeBuild logs and fixed at the application layer. We teach the client how to leverage Xray in order to better troubleshoot and enhance application performance.
- During this video conference we outline common troubleshooting scenarios that the client will run into and show them how to effectively troubleshoot the workload.
- We go over each and every component of the infrastructure and ci / cd pipeline that was developed with the client and allow them time to ask any questions.
Deployment Testing and Validation
Deployments are tested and validated through a promotion strategy. The only branch which automatically deploys without approval is the development branch, which is deployed to the isolated development environment. At this point, the team will QA and validate application functionality and approve a promotion to the staging environment. A pull request is submitted to source control and merged into staging. Workloads are then deployed to the staging environment. After testing and validation of staging, a pull request is submitted from staging into master and merged. Master branch triggers a build and deployment to production via CodeBuild / CodePipeline.
Version Control
All code assets are version controlled within GitHub.
Application Workload and Telemetry
CloudWatch application logging is integrated by default into all of our container and serverless workloads. We include this as an “in scope” item for all DevOps projects. This provides a centralized system where error logs can be captured and aid in operational troubleshooting.
Xray is implemented for application request tracing to help the client debug and improve performance of their workload.
Security: Identity and Access Management
Access Requirements Defined
In order to discover access requirements, we take a look at the organizational units within the client’s business, which will be required to access DevOps infrastructure. We discover developers, systems engineers, security engineers, and stakeholders. We have previously defined best practices that we follow for each of these groups.
IAM groups are created for each of these Organizational Units and least privilege access is applied to each. Each group is only granted access to what they actually required.
Developer Policy
Our developer policy looks like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"ec2:AuthorizeSecurityGroupIngress",
"ec2:AuthorizeSecurityGroupEgress",
"ec2:RevokeSecurityGroupEgress"
],
"Resource": "arn:aws:ec2:*:*:*",
"Effect": "Allow"
},
{
"Action": [
"ec2:Describe*",
"iam:ListInstanceProfiles",
"mgh:CreateProgressUpdateStream",
"mgh:ImportMigrationTask",
"mgh:NotifyMigrationTaskState",
"mgh:PutResourceAttributes",
"mgh:AssociateDiscoveredResource",
"mgh:ListDiscoveredResources",
"mgh:AssociateCreatedArtifact",
"discovery:ListConfigurations"
],
"Resource": "*",
"Effect": "Allow"
},
{
"Action": [
"ec2:CreateSecurityGroup",
"ec2:ModifyInstanceAttribute",
"ec2:CreateTags",
"ec2:CreateVolume",
"ec2:AttachVolume",
"ec2:DetachVolume",
"ec2:DeleteVolume",
"ec2:CreateImage"
],
"Resource": "*",
"Effect": "Allow"
},
{
"Condition": {
"ForAllValues:StringLike": {
"ec2.ResourceTag/appenv": [
"rmmigrate-dta"
]
}
},
"Action": [
"ec2:TerminateInstances",
"ec2:StartInstances",
"ec2:StopInstances",
"ec2:RunInstances"
],
"Resource": "*",
"Effect": "Allow"
},
{
"Action": "iam:PassRole",
"Resource": "*",
"Effect": "Allow"
}
]
}
No processes deployed to AWS infrastructure will make use of static AWS credentials. All instances which call other AWS functions use roles. The only case where static AWS credentials are used to call AWS services is when third party integrations can’t make use of assumed roles.
Log into AWS for each APN partner and user of the platform, make use of unique IAM users or federated login. No root access is permitted. We have a CloudWatch alarm setup which triggers an SNS notification via email anytime the root user logs in.
Security: Networking
All security groups within the environment meets the following requirements:
- Restricting traffic between the internet and VPC
- Restricting traffic within the VPC
- Only allows access from the Security Group
Security IT / Operations
Components which require encryption:
- Lambda Variables: These are encrypted at rest using KMS.
AWS API Integration
AWS CLI is used for all programmatic access.
Reliability
Deployment Automation
The deployment process is fully automated. When we merge a change into the master branch from development within GitHub, CodePipeline is triggered. CodePipeline first runs CodeBuild and compiles application dependencies via pip and requirements.txt, then creates an artifact and CloudFormation template which triggers the deployment of the serverless function via CloudFormation. We use change sets and then automatically execute those change sets via CodePipeline.
Availability Requirements
- RTO: Application can be down for a maximum of 3 hours without any significant harm to the business.
- RPO: 24 Hours
Data is backed up every 24 hours, so the worst case scenario is that we lose a day.
Adapts to Changes in Demand
This application uses Lambda, API Gateway, and RDS Aurora Postgres Serverless. We are using provisioned concurrency for Lambda and autoscaling on Aurora. We can respond rapidly in response to demand.
Cost Optimization
Cost Modelling
We deployed the workload into a development environment and load tested the application using methods previously described. We then gathered metrics such as execution time and memory allocation to estimate costs. Our RDS Postgres Costs were determined based on the capacity units required by the application over a 24 hour period. Elasticsearch and ElastiCache costs were fixed and included in the TCO analysis for the AWS environment. We found our initial estimate was 95% accurate as it was only $100 away from the average costs since workload has been deployed to production.
Looking to implement an Automated Serverless Architecture? Contact one of our Serverless Specialists today.