How to avoid pitfalls in building serverless data lake on AWS

In the article 17 reasons to build serverless data lake on AWS, we have seen the importance of having a data lake on AWS cloud using serverless components instead of an on-premise solution.

However, there are certain pitfalls associated with this, and in this article, we will delve into these challenges along with their mitigation approach.

Before we jump to them, let’s have a look at the main serverless components:

Ingestion & Storage	Processing & Analytics	Consumption	Governance
Data Sync	Glue	Athena	SNS
S3	Lambda	QuickSight	SQS
Glacier	Fargate	API Gateway	Cloudwatch
EFS	Kinesis Firehose	Redshift Spectrum	Cloudtrail
DynamoDB			CodePipeline
Arora Serverless			IAM

Let’s discuss the disadvantages and the mitigation/avoidance approach one by one.

The Cost

It might appear incredulous but some of the serverless components are actually costlier compared to their regular versions for same configuration and execution time. Example, Glue costs $0.44 per DPU-Hr while EMR for the same configuration costs only $0.26. The reason is not difficult to understand, the EMR maintenance burden is on customer whereas for Glue, the responsibility is on AWS and hence included in price.

EMR needs considerable effort to configure and run. However, automation can reduce this burden to a large extent and spending some time and effort to create such a framework for provisioning a just-in-time transient EMR cluster can give significant cost advantage in the long run.

Recommendation: Go for a judicious mix depending on workload. If the workload is small, infrequent and ad-hoc in nature, continue with Glue. However, for steady regular execution pattern, start with Glue in small volume PoC mode but migrate to EMR with a reusable framework for transient clusters once decided to productionize and scale.

Similar considerations exist for Lambda vs ECS/EKS/EC2 as well. Athena runs a PrestoDB cluster under the hood and costs $5 per TB scanned. If the organization has only a few queries to run per day, Athena is a great choice. But if the data scan volume is higher, more frequent and regular in nature, running own presto cluster on AWS turns out to be a cheaper option.

The Restrictions

Many serverless components come with significant limitations and restrictions.

The maximum execution time limit for a lambda is 900 second, the maximum memory size is 3 GB and maximum size of deployable code is 50 MB.

The Athena is based on 0.172 version of Presto engine and from that time onward, around 70 more releases of Presto have taken place broadening its functionality and increasing performance to much larger extent but no such upgrade has taken place for Athena. As a result, a large number of Presto functionalities does not exist on Athena.

Similarly, Glue gives only one standard node configuration option while multiple configurations with a much larger variety of tool sets exists for EMR. Glue also have lesser functionality and built-in components and libraries compared to EMR. Moreover, EMR can run native spark code as is but small modifications are needed to run on Glue requiring around 5 to 10% extra effort for conversion of code to and fro.

Finally, purpose-built enterprise ETL and Visualization tools like Informatica, DataStage, Denodo, Tableu etc, provide a lot more features, functionalities, availability of ready-made connectors for diverse data source systems and overall ease of use for DevOps team in their narrow areas of usage compared to AWS serverless components. Many of these tools are available directly via AWS marketplace and have integrated quite well within AWS ecosystem.

All the above need to be kept in mind while making architectural decisions. This list is no way exhaustive and a huge number of such limitations exist.

Recommendation: Again, a judicious mix of serverless, non-serverless and non AWS components gives the best result for a large organization with higher IT capabilities. While architecting, always think with serverless design first but be ready to switch to other options if the cost-benefit analysis points so.

The Performance

The infrastructure provisioning of serverless components happens on demand and hence almost always end up with a higher initiation time because unlike the dev team, AWS, as infrastructure provider, does not have much idea about the usage pattern of the application in advance. While in some case the time is miniscule eg. a few milliseconds for a lambda function, the Glue job is on the other side of the spectrum where it typically takes 10 to 15 minutes to start execution and that time must be budgeted for in the overall completion time.

Same is the issue with scaling up and down, there is always a lag because the response is reactive in nature. While the reaction time is far less than any human can even think of, it is still no substitute for the knowledge of the application developers about its intended usage time and expected volume.

Recommendation: Multiple solutions exist for these issues; one common approach is a pre warm up routine to provision infrastructure in advance when faster response is needed. The architect and developers must be aware of these intricate details and take appropriate mitigation or avoidance approach.

The DevOps team

The common orchestration, monitoring and debug tools in AWS cater for a far larger mixed use case of IT systems. This means traditional data analytics developers need to adopt to a different set of tools than they are typically used to. Also, in case of serverless, they do not have direct access to the underlying infrastructure components, making testing and debugging far more complex. Hence, there is a significant learning curve for developers coming from traditional analytics world.

On top of this, AWS provides significant low level powerful capabilities in the hand of developer with cost per usage. Any inexperienced developer creating non-optimal code will incur significant higher cost and get lesser performance from the system. Example, adding columns that are not required for the end result in an Athena query would increase the cost many fold for higher amount of data scan. Glue execution is charged in 10 minutes’ slab and inefficient Glue code resulting in unnecessary data shuffle would cause both higher execution time and much higher cost.

The minute attribution of cost on various parameters requires deep knowledge of AWS as well as high level of past experience on data lake platforms to estimate the cost upfront. For example, cost for S3 usage depends not only on the volume of the data but also on number of read/write request and amount of ingress and egress of data. DynamoDB cost depends on volume of data, index, cache, on-demand and scheduled backups as well as number of read, write and restore operations etc. This kind of minute cost attribution is good for accountability but makes the task of estimation that much complicated and specialized.

Recommendation: There is no magic bullet to avoid the above other than deep knowledge and experience in this area. Mature IT organizations and service providers have built re-usable accelerators and frameworks with best practices based on experience to reduce or eliminate the chances of mistake. Partnering with them increases the chance of success manifold.

Conclusion

Serverless created a huge excitement when it first appeared on the scene via S3 and Lambda. However, not all the pitfalls were apparent and many early adopters have been badly burnt. Over the years, the service offerings and knowledge around them has matured and now, with experience partners, the true benefit of the serverless data lake on AWS can be realized deftly avoiding the pitfalls and traps. Once deployed, this is one of the key ingredient for continued success of an organization evolving over a period of time and still remaining relevant by bringing in new insight time and again.

Industry :

Data, Analytics & AI

About the Author

Swagata De Khan
Senior Architect - Data, Analytics & AI, Wipro Limited.

Swagata has over 20 years of data warehouse experience and has successfully executed large engagements for global MNCs. He is an AWS certified Solution Architect and currently focuses on solutions involving Cloud, Integration Technologies, Artificial Intelligence and Machine Learning.

How to avoid pitfalls in building serverless

data lake on AWS

About the Author

Related Blogs

AI Driving 5G Innovations for Communications Service Providers

Making ML Models in Banking Resilient using Adversarial Attacks

Is your Mobility Landscape Harnessing the Full Potential of Mobility Managed Services & Security?