Projects
Data Ingestion Pipeline

Data Streams & Ingestion Pipeline

In a high-stakes project requirements for the GOV.UK public sector firm, I along with my team was tasked with migrating critical data from legacy source systems, such as Oracle and Microsoft Access, to modern, scalable targets like Postges/Amazon Aurora and Amazon Neptune.

The objective was to ensure a seamless transition with minimal downtime while maintaining data integrity, data quality and meeting GOV.UK compliance standards.

Also, an additional Business requirement was to process and transform data in real-time to support the firm's analytical and operational needs on the UK borders.

Key Challenges Solved

  • Volume of data (in-country and external)
  • Switching to Skyscape or AWS cloud
  • Data loss prevention mechanism
  • Data replaying mechanism
  • Data quality check & data exception handling
  • Environment management and provisioning
  • Real-time data processing and transformation
  • Test coverage of end-to-end data pipeline
  • PII data compliance

My Key Contributions

To address the critical business requirements, I collaborated with a multidisciplinary team of architects, DevOps engineers, BAs and SDETs to implement an MVP data pipeline solution initially.

As we refined the pipeline, we tackled additional challenges and requirements, ultimately deploying a reliable product into production.

  • Leveraged Kafka for real-time data streaming, orchestrating continuous data flow from Oracle and MS Access to PostgreSQL and then to Neptune.
  • Employed comprehensive unit and end-to-end integration testing strategies to anticipate and mitigate potential data/functional issues.
  • Employed comprehensive NFR testing strategies to anticipate and mitigate potential load and reliability issues, ensuring a smooth operation.
  • Created Cookie-Cutter developer and tester templates for Python multi-module projects to minimize human error and save expensive build hours

Tech & Operational Documentation

  • Created Confluence run-books and work-instructions pages for Operations team
  • Created Confluence live-support instruction pages for Operations and Tech-support team
  • Created API and service contracts for external dependencies, aimed at low-tech teams
  • Created Business Spec versioning guidelines for external dependencies, aimed at low-tech teams
  • Created Swagger/OpenApi documentation versioning for developers and testers

Tech Stack Used

+ DATA STREAMING:
  Kafka
 
+ CONTAINERISTION:
  Docker,K8s
 
+ DATABASES:
  Postgres,Oracle,Neptune,Aurora
 
+ LANGUAGES:
  Python,Java
 
+ INFRA:
  AWS
 
+ CI CD:
  Jenkins
 
+ SCM :
  Git,Bitbucket
 
+ SCRIPTS:
  tf,cf,bash
 
+ VISUALIZATION:
  Grafana,Prometheus