About

A Data Engineer with over 4 years of experience in banking: designing and automating data pipelines using Python, PySpark, Airflow, dbt, Docker, CI/CD GitHub Actions, SQL, and Google Cloud Platform (BigQuery, Cloud Storage, Cloud Composer, Dataproc). Ensuring data flows seamlessly from source to insight.

I believe in empowering organizations to achieve more in a competitive data-driven world by building reliable, scalable, and future-ready data systems.

Experiences

2025 — Present

Big Data Engineer · Bank Negara Indonesia
  • Engineered end-to-end big data pipelines using IBM DataStage to ingest multi-source data into the Bronze layer, ensuring consistency and reliability across 10M+ records daily.
  • Designed and implemented Silver and Gold layer transformations with PySpark and Hive/Impala QL, boosting data processing performance by 40% and enabling faster downstream analytics.
  • Built scalable datamarts and database models to empower data analysts and scientists in applying business logic seamlessly, cutting query time by 60%.
  • Optimized Python scripts and modularized reusable components to eliminate repetitive logic, reducing maintenance workload by 30%.
  • Deployed, monitored, and tuned Spark workflows on Cloudera Machine Learning (CML), achieving 99.8% job reliability across scheduled data pipelines.
  • Explored and implemented modern data tools (Airflow, dbt, Docker, Google Cloud Storage, Google Cloud Composer, Google BigQuery, GitHub Actions) to prototype modular ELT/ETL workflows and CI/CD-style data testing for analytics automation.
Python SQL (Hive & Impala) PySpark ETL ELT Automation Cloudera Machine Learning IBM DataStage Bash WinSCP mobaXterm Airflow dbt Docker Postgres API Google Cloud Storage Google Cloud Composer Google BigQuery CI/CD GitHub Actions

2023 — 2025

Business Analytics · Bank Negara Indonesia
  • Transformed the process of lead generation and monitoring from a manual process (run the script every time) into a fully automated ETL pipeline. The solution included features such as auto-password-protected Excel files, FTP delivery, and email notifications; resulting in a 60% reduction in processing time and a significant boost in productivity.
  • Designed and implemented ETL workflows for lead delivery to the in-house Digisales channel, including handling massive sales assignments. This automation reduced manual intervention by 70%, enhancing operational efficiency and accuracy.
  • Developed and deployed automated data pipelines using Python and PySpark, efficiently processing over 1 million rows daily to ensure seamless data flow and availability for business-critical operations.
  • Optimized high-speed fuzzy string matching for Big Data using RapidFuzz and Multiprocessing for parallel processing. The solution achieved a 30% increase in matching accuracy and reduced processing time by 50%, enabling efficient handling of datasets with over 2 million records.
  • Developed an automated SQL query retry mechanism using Python to handle query execution failures. This implementation reduced query failure resolution time by 80% and increased workflow reliability by ensuring 100% query execution success without manual intervention.
Python SQL (Hive & Impala) PySpark ETL Automation Selenium Cloudera Machine Learning Tableau Pandas Numpy Fuzzy Matching Seaborn Data Analysis

2021 — 2023

Data Analytics · Bank Negara Indonesia
  • Developed automation bash scripts for seamless data transfer between big data ecosystems and FTP servers, resulting in a 50% reduction in manual data handling time, improving accessibility and efficiency.
  • Leveraged big data tools like Hadoop, Impala, Hive, Spark, and Python to efficiently process data, reducing processing time by 30% and enabling faster insights and decision-making.
  • Orchestrated the preparation and management of datamarts for data scientists, using complex SQL queries and Cloudera Data Science Workbench.
  • Maintained documentation of scripts and workflows to support ongoing optimization and knowledge transfer.

Python SQL (Hive & Impala) PySpark ETL Bash Automation Cloudera Data Science Workbench Pandas Numpy Fuzzy Matching RapidFuzz Multiprocess Scikit-Learn Tableau

2020

Data Science & Machine Learning ·
Purwadhika Digital Technology School

Transitioned into the data and programming world with excitement, discovering my strength in embracing challenges and learning. Found my passion and uniqueness, which still drives me today.

Python Data Analysis SQL Machine Learning

2017 — 2020

Inspector · Jaya Construction Management

Before transitioning to tech, I leveraged my civil engineering background in a construction management role. I gained experience in project management, leadership, and coordinating contractors while overseeing timeline, cost, and quality control. This role emphasized the value of organization, teamwork, and delivering results efficiently.

Project Management Leadership Timeline Control Cost Control Quality Control Project Coordination

Projects

Sales Summary ELT Pipeline using Composer, dbt, CloudStorage, BigQuery, CI/CD GitHub Actions

This project is an end-to-end (including CI/CD) E-Commerce ELT pipeline built on Google Cloud Composer, powered by Airflow 2.x. It’s designed to mimic a production-ready workflow but fully local for anyone who wants to understand how a modern ELT pipeline works end-to-end. Extracting from 3 different data sources like API, Postgres, and Google Cloud Storage.

Google Cloud Composer dbt SQL Google Cloud Platform Google Cloud Storage Google BigQuery CI/CD GitHub Actions Postgres API
Scale-up Banking Loan Risk ELT Pipeline using Airflow, PySpark, dbt, BigQuery, CloudStorage

This project is my take on building an end-to-end big data pipeline for a banking loan system from raw data to analytics-ready tables. The story starts with a simple question: “How do banks process millions of loan records daily and detect risky borrowers?” To explore that, I built this pipeline that mimics a real-world financial data flow.

Airflow Docker PySpark dbt SQL Google Cloud Platform Google Cloud Storage Google BigQuery Big Data
Massive Lead Assignment to Sales ETL using Pandas, PySpark, Hive, and Cloudera Machine Learning

Manually managing lead distribution is time-consuming and inefficient, especially when dealing with unevenly assigned leads across regions and branches. This labor-intensive process is prone to errors and requires constant supervision. This project automates lead assignment using a round-robin process, dramatically reducing the need for manual intervention.

Python ETL PySpark Pandas Impala Hive Cloudera Machine Learning
Automated Data Processing with Encryption, Send to FTP, and Notify via Email

This script automates the process of encrypting an Excel file, uploading it to an FTP server, and notifying recipients via email. It is designed to simplify repetitive tasks, improve data security, and ensure efficient communication.

Python Pandas FTPlib PyWin32 Automation
High-Speed Fuzzy String Matching for Big Data

String matching in massive datasets can be painfully slow and inefficient when using traditional methods. This project addresses the problem by leveraging RapidFuzz for fuzzy matching, Multiprocess for parallelized processing, and PySpark for handling big data seamlessly. The result is a high-speed solution designed to efficiently process and match millions of records.

Python PySpark SQL RapidFuzz Multiprocess Pandas Fuzzy Matching
Automatic CDSW File Downloader Using Selenium

Imagine manually downloading over 10 monitoring reports weekly from Cloudera Data Science Workbench (CDSW/CML)—a time-consuming and repetitive task. Navigating through project directories and manually clicking download is inefficient and error-prone. Automating this process would save significant time and reduce errors.

Python Selenium ETL HTML Cloudera Machine Learning Cloudera Data Science Workbench Automation
Auto-retry SQL Query Execution

When running SQL queries through Hue for Impala or Hive, encountering errors is frustrating. Each time a query fails, I had to manually hit the 'run' button or press Ctrl + Enter repeatedly, which is time-consuming but also mentally exhausting. Automating this process would be far more efficient, ensuring queries retry automatically—giving me peace of mind that the execution will succeed without constant supervision.

Python SQL Automation
Optimize LendingClub's Profit
with Analysis & Machine Learning

Investors made money from loan interest, while LendingClub (LC) made money by charging borrowers an origination fee and investors a service fee. However, investors still have risks like credit and liquidity risks. If investors don't get their interest return, LC does not get money and profit from service fees. The project aims to find out the characteristics of borrowers who stop repaying and how to optimize LendingClub's profit.

Python Flask Pandas Numpy Matplotlib Seaborn Scikit-Learn Data Analysis Machine Learning HTML CSS
End-to-End Customer Segmentation and
Multi-Product Leads SQL Pipeline

The bank's marketing team needs an efficient way to target high-value customers with personalized financial products. The manual process of gathering customer data is time-consuming and prone to errors. This project built a SQL-based pipeline for customer segmentation and generating multi-product offerings, streamlining the identification of high-value customers and providing them with personalized product recommendations.

SQL (Hive & Impala)