October 06, 2024

Current status of a datalake

So what have I have been doing for the past couple of months? I have been working and learning lots of things regarding datalake technology on AWS. This datalake stuff is very interesting because I needed to convert data from a PostgreSQL database to Parquet-files. A container is started using AWS Batch that connects to the database and queries for specific data pertaining to a specific customer. The data is then convert in the format: CUSTOMER/YEAR/MONTH/file.parquet and uploaded to S3. When the conversion process is done there is an AWS Glue Crawler that runs through the data and makes is searchable through AWS Athena. So what we have here is a combination of technologies: AWS Batch for data processing, S3 for data storage, AWS Glue to organize data and then AWS Athena to make the data searchable through SQL.

Is it working? Yes, it is! It is working pretty well. One thing that I have not been able to do, is making the process faster when querying the database and converting the data to Parquet. It is a slow proces, but that is just how it is.