HARIKRUPA VEDERE

Senior Data Engineer

US.

Career Overvieew

Senior Data Engineer with 11+ years of hands-on experience building end-to-end ETL and lakehouse pipelines across AWS, Azure, and GCP for healthcare, telecom, and banking clients. Specializes in PySpark, Kafka, and Delta Lake to handle petabyte-scale data under strict compliance requirements, with a track record of cutting query runtimes by over 60%. Works across AI/ML integration, real-time streaming, and fraud detection, combining Microsoft Fabric and GenAI-powered workflows with modern cloud-native tools to build scalable data platforms that turn raw data into reliable, decision-ready systems.

Technical Expertise

Big Data Ecosystem

Hive, Apache Spark, PySpark, Spark SQL, Spark Streaming, Structured Streaming, Kafka, Kafka Streams, Confluent Kafka, Kafka Connect, NiFi, Sqoop, Flume, MapReduce, HDFS, YARN, Zookeeper, Apache Beam, Apache Druid, Apache Flink, Impala, HBase, Ambari, Airflow, Airflow DAGs, Oozie, Cloud Composer.

ETL and Data Integration

AWS Glue, Azure Data Factory, Informatica PowerCenter, IICS, IDMC, Talend, SSIS, Oracle Data Integrator, Semarchy xDM, Reltio, MDM, CDC, ELT, Incremental Loading, Job Scheduling, SLAs, Airbyte, Soda, dbt.

Lakehouse and Query Engines

Delta Lake, Apache Iceberg, Trino, Presto, Athena.

Programming Languages

Python, SQL, PySpark, Scala, Unix, T-SQL, PL/SQL, Java, Spring Boot.

Cloud Environment - AWS

EMR, S3, Glue, Redshift, Redshift Spectrum, Lambda, Athena, EC2, RDS, DynamoDB, Kinesis Data Streams, Kinesis Firehose, EventBridge, SQS, VPC, IAM, CloudWatch, Step Functions.

Cloud Environment - Azure

Azure Databricks, ADLS Gen2, Data Lake, Blob Storage, Azure SQL, Cosmos DB, HDInsight, Azure Synapse Analytics, Azure Functions, Event Hubs, Azure Monitor, Log Analytics, RBAC, Managed Identities, Microsoft Fabric, Microsoft Purview.

Cloud Environment - GCP

BigQuery, Dataproc, Cloud Storage, GKE, Cloud Functions, Spanner, Pub Sub, Dataflow, Bigtable, Cloud Composer, Cloud Monitoring, Vertex AI, Cloud AI Platform, KMS, IAM.

Databases and Tools

SQL Server, MySQL, PostgreSQL, Oracle, Teradata, MongoDB, DynamoDB, Cassandra, Cosmos DB, Erwin, Palantir Foundry.

Reporting and BI Tools

Power BI, Tableau, Looker Studio, Google Data Studio, OBIEE, Microsoft Fabric.

Python Libraries and ML

NumPy, Pandas, Scikit Learn, Matplotlib, TensorFlow, PyTorch, PySpark ML, Spark MLlib, BigQuery ML.

GenAI and AI/ML

RAG, LangChain, LlamaIndex, LangGraph, Agentic AI, Copilot, LLMs, Prompt Engineering, Microsoft Copilot Studio, FAISS, Pinecone, Vector DB, Azure OpenAI, OpenAI, AWS Bedrock, Azure AI Search, Amazon OpenSearch, AWS SageMaker, Azure AI, Microsoft Foundry, AI Agents, Bedrock Agents, Multi Agent Orchestration.

Automation

Microsoft Power Platform, Power Automate, Power Apps, Power Pages, Dataverse, Custom Connectors.

Containerization and DevOps

Kubernetes, Docker, Jenkins, GitHub Actions, Azure DevOps, IaC, Terraform, Pulumi, AWS CDK.

Data Observability and Monitoring

Prometheus, Grafana, Datadog, Splunk, ELK Stack (Elasticsearch, Logstash, Kibana), OpenTelemetry, CloudWatch, Azure Monitor, Log Analytics, GCP Cloud Monitoring, PagerDuty.

Data Quality, Catalog, and Governance

Great Expectations, Monte Carlo, Bigeye, Soda, Apache Atlas, Apache Ranger, Microsoft Purview, Collibra, Alation, DataHub, Amundsen, Data Lineage, Data Contracts.

Software Life Cycle/Methodologies

Agile Models, Waterfall, SDLC, CI/CD, Infrastructure-as-Code, AWS CDK, GitHub Actions, Azure DevOps, Audit Trails, Data Lineage, Data Governance, Row-level Security, Column-level Encryption.

Professional Background

JOHNSON & JOHNSON

SR. BIG DATA ENGINEER

New Brunswik, NJ, US

Oct 2023

→

Present

Remote

Summary

Built and scaled enterprise data platforms across Johnson & Johnson's healthcare, pharmaceutical, and retail operations, delivering lakehouse architectures, real-time pipelines, and GenAI-powered solutions while maintaining strict regulatory compliance standards.

Highlights

Built and managed bronze-silver-gold lakehouse platforms using Delta Lake and Microsoft Fabric, enforcing ACID guarantees, schema governance, and zone-based data quality across 15M daily patient records.

Designed and optimized multi-source ETL and ELT pipelines handling batch, incremental, and CDC-based data movement with watermarking, schema drift handling, and dependency scheduling at scale.

Integrated EHR systems using FHIR and HL7 standards, building ingestion pipelines with Apache Kafka and Apache NiFi to consolidate patient records across hospital endpoints and reduce physician data retrieval time by 40%.

Implemented HIPAA-compliant data security including patient de-identification, column-level encryption, and role-based access control across clinical data platforms.

Improved analytical query performance by over 45% through Delta Lake optimization techniques including Z-ordering, partition pruning, and automated table maintenance orchestrated via Airflow.

Built GenAI-powered data assistants using LLMs and RAG architectures, reducing report generation time from days to hours for clinical and business stakeholders.

Deployed Agentic AI workflows using LangGraph and Bedrock Agents to automate complex multi-step clinical data enrichment tasks, cutting manual curation effort significantly per workflow cycle.

Delivered end-to-end retail analytics consolidating Salesforce, Shopify, and SAP data using PySpark across cloud environments, driving 20% better inventory distribution and 30% improvement in demand forecasting accuracy.

Built clinical and operational BI dashboards in Power BI with row-level security, connecting gold-layer lakehouse tables to support high-concurrency reporting across regulated cloud environments.

AT&T

SR. DATA ENGINEER

Dallas, TX, US

Dec 2020

→

Sep 2023

Summary

Designed and optimized terabyte-scale data ingestion and processing pipelines for CDRs and network usage logs, enhancing real-time analytics and reducing compute costs for telecom operations.

Highlights

Designed Azure Data Factory pipelines for CDRs and network logs into Azure Data Lake Storage, applying watermarking and dependency scheduling for data freshness.

Built PySpark jobs on Azure Databricks to aggregate CDR data into call success rates and data consumption measures, using broadcast joins for strict SLA adherence.

Implemented event-driven ingestion with Apache Kafka, Azure Event Hubs, and Google Pub Sub for high-frequency network signaling events, tuning partitioning and consumer groups for throughput.

Utilized Apache HBase for near real-time network metric storage and integrated Apache Druid for fast OLAP queries across 500 billion telecom events monthly.

Optimized BigQuery analytical datasets through partitioning, clustering, and scheduled refreshes, reducing dashboard load times via materialized views for multi-terabyte workloads.

Orchestrated GCP data workflows using Cloud Composer with Airflow DAGs for Pub Sub ingestion, Dataflow streaming, and BigQuery queries, implementing retry logic and SLA sensors.

Implemented CI/CD pipelines using GitHub Actions and Terraform to automate deployment of Azure/GCP infrastructure and Spark jobs, applying Pulumi for infrastructure definition.

Built centralized monitoring using Azure Monitor, Google Cloud Monitoring, and Grafana to track pipeline latency and Spark job performance, integrating Prometheus for tuning.

BANK OF AMERICA

DATA ENGINEER

Charlotte, NC, US

Feb 2018

→

Nov 2020

Summary

Engineered real-time fraud detection and batch ingestion pipelines for credit card and mobile banking transactions, significantly reducing manual investigation workload and improving accuracy.

Highlights

Built real-time transaction ingestion pipelines using Apache Kafka and Confluent Kafka, configuring partitioning for fault-tolerant delivery under peak transaction volumes.

Designed Apache Flink streaming jobs with Flink SQL and CEP patterns to detect complex fraud scenarios, enabling sub-second suspicious transaction flagging.

Implemented dynamic risk scoring combining rule-based fraud signals with ML model outputs managed by MLflow, achieving 94% accuracy in identifying suspicious transactions.

Built batch ingestion pipelines using Google Cloud Dataflow and Apache Beam to move 5+ years of mainframe transaction records into Google Cloud Storage, sustaining over 10,000 messages per second.

Optimized BigQuery datasets through partitioning and clustering, reducing fraud trend query runtimes from hours to minutes for risk reporting.

Developed Spring Boot RESTful fraud alert APIs with OAuth 2.0 authentication, improving API response times by 60% for external banking systems.

Implemented GCP IAM access controls with fine-grained dataset-level permissions across BigQuery, Cloud Storage, and Cloud SQL for financial regulatory compliance.

Reduced manual fraud investigation workload by over 40% by integrating Kafka, Flink, Redis, HBase, MLflow, and Tableau into a unified fraud operations platform.

PAYCHEX

DATA ENGINEER

Rochester, NY, US

Nov 2015

→

Jan 2018

Summary

Led the migration and standardization of payroll data for over 650,000 clients, significantly improving processing efficiency and compliance through robust ETL pipelines and data quality frameworks.

Highlights

Built centralized payroll data migration pipelines using Talend ETL and Azure Data Factory, supporting full historical and incremental loads for over 650,000 client businesses.

Processed and standardized payroll data with Python and Azure Databricks (PySpark), applying Delta Lake merge operations to reconcile migrated calculations.

Re-architected ETL orchestration by migrating dependency logic to Apache Airflow DAGs on Azure, reducing failed payroll runs by 78% during peak cycles.

Optimized Azure SQL Database for curated payroll datasets, improving average query execution time from 3.2 seconds to under 0.4 seconds.

Implemented SOX and PCI-compliant security controls using column-level encryption and row-level security across ADLS, Azure SQL Database, and Azure Databricks with Azure Key Vault.

Developed ML models on Azure Databricks (PySpark ML, Scikit-Learn) to improve payroll cash flow forecasting, achieving 89% precision 30 days ahead.

Applied Apache Spark batch processing in Azure Databricks to migrate quarterly tax filing workloads with ACID-compliant Delta Lake operations, reducing end-to-end tax processing time from 18 hours to 4.5 hours.

Executed phased migration strategies, verifying 100% record accuracy against source systems with 0 disruptions during critical tax filing deadlines.

RAMCO SYSTEMS

ETL DEVELOPER

Chennai, Tamil Nadu, India

May 2014

→

Sep 2015

Summary

Developed and optimized ETL pipelines for aviation maintenance ERP systems, integrating diverse data sources and reducing nightly processing windows by over 65%.

Highlights

Built ETL pipelines using Oracle Data Integrator, Informatica PowerCenter, and Talend Open Studio, integrating data from Oracle, MySQL, MongoDB, and Amazon S3.

Set up Oracle Database storage architecture alongside Amazon RDS, Amazon S3, and Redshift, applying partitioning and S3 tiering to improve query performance.

Used Informatica PowerCenter to extract data from MySQL, Oracle, & MongoDB, staging output into Amazon S3 while preserving referential integrity across 12 ERP systems.

Automated ETL job scheduling using Oozie workflow definitions & Cron jobs in Oracle Enterprise Manager, ensuring correct execution order for interdependent processes.

Configured Amazon CloudWatch alerts for Talend jobs processing CSV/XML files from on-premise systems and Amazon S3, enabling rapid incident response.

Used Informatica to cleanse, enrich, and validate enterprise asset management datasets before loading into Amazon Redshift, integrating AWS Glue crawlers for schema discovery.

Built Oracle Business Intelligence reports from Oracle data warehouses and Amazon Redshift data marts using star schemas, designing role-based access controls.

Monitored pipeline execution through CloudWatch for real-time manufacturing performance and supply chain KPI analysis, maintaining audit trails.

Educational Qualifications

Jawaharlal Nehru Technological University Hyderabad (JNTUH)

Hyderabad, Telangana, India

Aug 2010

→

Apr 2014

Bachelor of Technology – B.Tech

Computer Science