About this role
Role Overview & Key Responsibilities
• Data Pipeline Operations & On-Call : Own on-call rotation for ingestion pipelines (Kafka, AWS Glue); triage and resolve pipeline failures, schema mismatches, and throughput degradation; author RCAs.
• Data Quality Monitoring : Implement and maintain data quality checks across Bronze->Silver->Gold lakehouse layers (S3->Kafka->Snowflake/Redshift); alert on anomalies, missing data, or drift.
• ML Model Health & MLOps : Monitor deployed models for accuracy degradation, data drift, and concept drift; manage model redeployment workflows; maintain ML experiment tracking.
• AI Platform Reliability (Bedrock + LangChain) : Monitor AWS Bedrock inference latency, token usage, error rates, and cost; operate LangChain agent pipelines; use Langfuse for Al evaluation and observability.
• DORA Metrics - Data & AI Lens : Track deployment and release health for data pipeline and model updates; measure lead time for data model changes; monitor pipeline reliability as a DORA proxy.
• Schema & Contract Management : Monitor AWS Glue Schema Registry for schema evolution events; validate Avro contract compliance for new producer payloads; coordinate schema changes with module teams.
• Snowflake / Redshift Operations : Manage query performance, warehouse sizing, cost controls, and data retention policies; monitor Gold-layer data freshness and SLA compliance.
• Incident Escalation : Serve as first-line triage for all data and Al incidents; escalate to core data/ML engineers only when root cause requires architectural changes or new feature work.
Required Skills & Experience
Data Engineering (Strong)
• 4+ years of data engineering experience with production-grade pipelines
• Proficient with Apache Kafka: consumer groups, topic management, lag monitoring, DLQ handling
• Experience with AWS Glue, AWS Glue Schema Registry, and Avro/Parquet data formats
• Hands-on with Snowflake or Redshift: query optimization, cost management, RBAC
• Familiarity with lakehouse patterns: Bronze/Silver/Gold (S3-based) data architecture
ML/AI Operations (Core Competency)
• Experience with MLOps practices: model versioning, drift detection, retraining pipelines Familiarity with AWS Bedrock, SageMaker, or equivalent managed ML inference platforms
• Working knowledge of LangChain or LlamaIndex for LLM application pipelines
• Experience with AI/LLM observability tools (Langfuse, LangSmith, or equivalent)
• Understanding of RAG (Retrieval-Augmented Generation) architectures and vector stores
Operational Excellence (Core Competency)
• DORA metrics application to data and ML delivery pipelines
• On-call experience for data infrastructure; structured incident management and RCA
• Data quality framework implementation: Great Expectations, dbt tests, or custom checks
• Experience with monitoring and alerting for streaming pipelines (Kafka lag, throughput)
Backend & AWS Exposure
• Python proficiency - scripting, pipeline development, data transformation
• AWS services: S3, Lambda, Glue, Bedrock, CloudWatch, SQS/SNS, IAM
• Familiarity with containerized workloads on Kubernetes (EKS)
• Experience with dbt or similar data transformation frameworks is a plus
Nice to Have
• Exposure to ontology or knowledge graph systems (RDF, OWL, or property graphs)
• Familiarity with Temporal for workflow orchestration of ML pipelines
• Experience with multi-tenant data platforms and row-level security patterns
• Understanding of GDPR-compliant data handling and encryption key management