Twenty years ago, I had zero programming experience. Today, I'm a data architect who watched countless technologies rise and fall while SQL remained unshakeable. From converting COBOL programs to solving critical performance crises, SQL became my career passport through developer, analyst, engineer, and architect role
From Documents to Rows: Our Journey Migrating MongoDB to SQL Server in AWS
When I received a MongoDB to SQL Server migration requirement in 2019, I had to pause. This wasn't typical relational-to-relational migration—it meant transforming flexible documents into rigid, normalized tables. Here's how I bridged two completely different data paradigms using Talend and what I learned about heterogeneous database migrations.
Loading 65 Million Records into Cosmos DB: A Weekend Data Migration Journey
Migrating 65 million records into Azure Cosmos DB seemed impossible with our 1000 RU/s limit. Through strategic planning, temporary scaling to 10,000 RU/s, and 15-batch processing with Azure Databricks, we completed the migration in 30 hours over a weekend, achieving 100% data integrity while maintaining security and cost efficiency.
Automate Data Security: Azure Logic Apps for SFTP Uploads
In the digital age, protecting data at every stage is essential, particularly for organizations handling sensitive or regulated information. One crucial aspect of data security is ensuring that files entering an organization's system are safe from malware and other threats. Automated file scanning at the point of entry is a robust strategy that can secure... Continue Reading →
Optimize Delta Lake Storage with VACUUM Command
As a data engineer managing batch file processing with Databricks, I recently encountered a storage issue that many teams face: rapidly increasing storage volume. In this blog, I'll share the challenge I faced with my Delta Lake storage, how I resolved it, and the benefits I gained by implementing Databricks' VACUUM command to manage storage... Continue Reading →
Understanding COUNT(*) vs COUNT(1) in SQL
COUNT(*) and COUNT(1) serve to count rows in SQL, yielding the same results but with nuanced internal processing. Modern SQL engines treat both functions similarly, resulting in negligible performance differences. COUNT(*) is preferred for clarity, while COUNT(1) is used out of habit. COUNT(column_name) counts non-NULL values in a specific column.
Optimizing Parallel Data Loads to Delta Lake: A Concurrency Issue Solution
The data lake architecture utilizes SFTP for data uploads from multiple customers, requiring parallel file loading into Delta Lake. Concurrency issues arose during merging operations, primarily due to simultaneous updates. The team implemented table partitioning by Customer ID and added retry logic to mitigate conflicts, planning a future upgrade to Databricks Runtime 15.4.