Optimize Delta Lake Storage with VACUUM Command

As a data engineer managing batch file processing with Databricks, I recently encountered a storage issue that many teams face: rapidly increasing storage volume. In this blog, I'll share the challenge I faced with my Delta Lake storage, how I resolved it, and the benefits I gained by implementing Databricks' VACUUM command to manage storage... Continue Reading →

Optimizing Parallel Data Loads to Delta Lake: A Concurrency Issue Solution

The data lake architecture utilizes SFTP for data uploads from multiple customers, requiring parallel file loading into Delta Lake. Concurrency issues arose during merging operations, primarily due to simultaneous updates. The team implemented table partitioning by Customer ID and added retry logic to mitigate conflicts, planning a future upgrade to Databricks Runtime 15.4.

Blog at WordPress.com.

Up ↑