AWS Data Lake Day

Ebru Cucen
6 min readDec 1, 2019

I wanted to share the sessions/what I have learnt from the AWS Data Lake Day (27/11/2019) in London. Even if Re:Invent is next week, and we are expecting tons of updates/new products, AWS was not shy to announce new features today. As they have expressed multiple times, they want to push the boundaries and improve their products by the constant feedback they receive from their customers, and definitely the half-day event proved they are on the right track.

Data is the new oil, and the amount of the data we need to handle is increasing tremendously. AWS has a single customer dealing exabytes of data.

The key takeaway from me was that with the demand of the millions of online users, Data Lake will help to store the structured and unstructured data, ready to be consumed by real-time analytics, machine learning algorithms, or dashboards, and we need to make sure it is implemented in a methodological process to avoid data swamps.

I enjoyed all presentations by AWS on Data FlyWheel, Storage, Analytics, and ML/AI. Also, there were also interesting client presentations by Adthera, EDF, and Expedia which were engaging, and here are my notes from the sessions.

“Data Flywheel”

Marc Trimuschat has done the opening session and explained how Amazon was based on a FlyWheel and has done a reflection on data, i.e. how the Data FlyWheel would look like:

  • Move to manage: You can break free from legacy databases, by migrating databases, data warehouses, or streaming data directly into AWS.
  • Run Fully Managed Databases: You can not only save time and cost but also remove the undifferentiated heavy lifting of common database administrative tasks. You can improve performance, lower the cost, take advantage of the tools to be introduced in the next x months.
  • Build data-driven apps: You will have the agility and globally distributed data which can scale with performance. You can choose from purpose-built databases such as Amazon DynamoDB for relational databases or Amazon Timestream for Time Series depending on your application’s requirements.
  • Analyze your data in data lake: You can have better, faster insights, broader access to the insights. AWS Lake Formation builds a secure Data Lake, which helps to move, store, catalogue and clean your data faster. You can enforce security policies across multiple services. Amazon QuickSight helps to visualise the data with rich dashboards.
  • Innovate: Having momentum in the flywheel, you can enable better experiences, deeper engagement, efficient processes. Amazon SageMaker helps to build, deploy and train the models quickly at scale.

“Storage”

Maribel Rodriguez presented the talk on storage, how S3 is the backbone of all the lake, they have over 10k data lakes on S3. On top of AWS Lake Formation, he talked about AWS Glue, which is a fully managed ETL tool, has crawlers to scan the data on S3 and populates Glue Data Catalogue, then your data will be ready to be analyzed.

And also, he shared the new announcements happened pre-Invent:

  • S3 Intelligent-Tiering: You don’t have to think about the storage access layer. With a small fee for monitoring and automation, it works out if the object is not accessed over 30 days, it moves to Infrequent Access Tier, and if accessed later, it moves to Frequently Access Tier.
  • S3 Glacier Deep Archive: If you need long term storage, and don’t want to deal with the tapes, storage, AWS is now offering 12-hour retrieval storage with competitive pricing with off-premises tape archival services.

The overall new products on AWS catalogue this year were FSx for Lustre, Amazon FSx for Windows File Server to support Multi-AZ and integrate with Self-Managed Active Directories, AWS Backup, EFS Infrequent Access.

“Analyze the Data”

Kaz Janiskowski presented the session on how to Analyze Data. The highlighted products were:

  • AWS Data Exchange: It is a platform to easily provide and subscribe to a product and reach millions of AWS customers. Data Providers have an easy way to package and publish data products, with built-in security and compliance controls. Data Subscribers can quickly find diverse data in one place, can efficiently access 3rd party data, and easily analyze.
  • Data Movement: AWS does provide different options to move the data. For On-premise movements for petabytes of data, you can use SnowBall, SnowMobile. To ingest streaming data, you can use Kinesis Data Firehose, Kinesis Data/Video Streams, Managed Streaming for Kafka.
  • AWS SCT (Schema Conversion Tool): It helps with the schema conversion. Not only it creates an assessment report, compatibility of source databases with OSS, but also it recommends the best target engine and provides the effort required. The attempt to convert the schema and the code does also cover stored procedures and functions. It scans and converts the embedded in-app code.

As a new feature, Kaz shared Athena’s new federated query, which was announced this week, how it solves the issue of analysing multiple data sources including on-premise or on cloud.

On the popular cloud DataWarehouse product AWS Redshift, Kaz shared the Key Innovations this year. You can find out more on the AWS site:

  • Amazon Lake Formation Integration
  • Spectrum Request Accelerator
  • Concurrency Scaling
  • New Management Console
  • Auto-Vacuum, Auto-Analyze and Auto-Table Sort
  • Snapshot Scheduler
  • Auto Data Distribution
  • Dynamic WLM Concurrency
  • Stored Procedures
  • Improving Short Query Acceleration
  • Faster Cross Regional Copy and Change
  • Query Priorities
  • Deferred Maintenance
  • Elastic Resize

“AI/ML Becoming Commodity”

Dimitri French had a session on ML/AI to explain the landscape on AWS. dividing the services into 3 categories: ML Frameworks + Infrastructure, ML Services, AI Services. The product range proved that most of ML/AI algorithms are ready to be consumed as commodity products, so I am very excited to watch out space and see the impact of the applications on our daily lives.

ML Frameworks + Infrastructure: If you want to run your own TensorFlow, MxNet, PyTorch, it does support with interfaces Gluon and Keras. You can host your models on your EC2 instances or on-prem/portable Kubernetes clusters.

Amazon SageMaker: It is the service API provides multiple features to enable ML, such as Ground Truth, Notebooks, Algorithms + Marketplace, Reinforcement Learning, Training, Optimization, Deployment and Hosting. Depending on your workloads, ML workflows, you can integrate any of these offerings.

AI services: AWS has the products to support all cognitive features.

As announcements:

Aurora does support ML by calling SageMaker, Comprehend natively, with simple SQL queries. It gives the ability to run the queries interactively from your codebase, or storing the output of the ML model executions, and review later.

This was a summary of the products I could gather, hope it is useful for you too, to digest what AWS does offer today, and to have insight about how we should look at Data Lake space.

Note: If you are interested in the Re:Invent AI/ML section, this is the list you would want to miss!

Happy data days!

--

--

Ebru Cucen

Dataist | All about Data & AI with software sprinkles on top