Market

How Data Engineering Leader Sumit Tewari Is Advancing Data Quality Frameworks for the AI-Powered Enterprise

Sumit Tewari is a data engineering and governance leader at the world’s largest retailer, based in Frisco, Texas. Deploying more than a decade of highly specialized experience designing scalable data platforms and frameworks for global enterprises, his work encompasses data architecture, quality automation, and business strategy to help organizations move from static data systems to dynamic, analytics-ready ecosystems.

Tewari leads cross-functional, global engineering teams responsible for developing resilient, enterprise-grade data flows and optimizing ETL architecture, analytics solutions, and disaster recovery frameworks. His role extends across technical, financial, and leadership domains, overseeing global HR data infrastructure budgets, driving proofs of concept for emerging technologies, and mentoring the next generation of data engineers. Holding a Master’s in Computer Applications from Jawaharlal Nehru National College of Engineering (JNNCE) in India, Tewari previously served as Vice President of Software Engineering for one of the world’s top five financial institutions. Today, he is helping advance his company’s Sustainability and Human Capital Management initiatives by leading Environmental, Social, and Governance (ESG) and Human Capital Data Lake programs that strengthen both data-driven decision-making and organizational performance.

In this interview with TechBullion, Tewari discusses emerging trends in data quality, how AI and automation are transforming enterprise pipelines, and the strategies organizations can adopt to ensure reliable, actionable data at scale. 

Ellen Warren: Sumit, in 2024, you wrote an article about data quality implementation across the phases of a data pipeline. What has changed in the realms of data quality, pipelines, and governance over the past 12 months?

Sumit Tewari: The shift has been dramatic. We’ve moved from static quality gates to continuous observability. In 2024, most organizations were still treating data quality as checkpoint validation: you’d check data at ingestion, maybe again before consumption, and hope for the best in between. Now in 2025, the paradigm is fundamentally different.

What I’m seeing across enterprises is the adoption of real-time monitoring that tracks data health constantly across every pipeline stage. Problems get caught as they happen, not days later when someone notices bad reports. The major cloud platforms have stepped up significantly. Google Dataplex introduced Auto Data Quality with ML-based anomaly detection that requires minimal setup. Microsoft Purview now provides unified governance with embedded AI capabilities and data health scoring. AWS Glue has advanced substantially with dynamic rule generation and support for modern table formats like Apache Iceberg and Delta Lake.

The other major change is around data contracts. Organizations are finally formalizing agreements between data producers and consumers with versioned specifications. These contracts clarify who’s responsible for what and enable automated validation that catches problems before bad data enters the pipeline. It’s no longer about informal handshakes between teams.

EW: AI models and generative tools are now being trained directly on enterprise data. How is this trend reshaping the definition—and urgency—of data quality?

ST: Never before have the stakes been so high. Bad data could result in a poor business judgment when creating traditional reports. However, while AI models are being trained, faulty input corrupts the entire model and all of its subsequent outputs, not just one decision. Nowadays, the old saying has been replaced with, “garbage in, garbage stays, and spreads as well.”

What’s changed is that AI systems amplify quality issues exponentially. A dataset with 95% accuracy might seem acceptable for traditional analytics, but that same 5% error rate can cause an AI model to learn incorrect patterns and make systematically wrong predictions. We’re also dealing with new quality dimensions that didn’t matter as much before semantic consistency, contextual accuracy, and bias detection.

The urgency comes from the fact that organizations are deploying these AI systems at scale without always upgrading their data quality infrastructure first. I’m working with teams who are excited about generative AI capabilities but haven’t established the foundational quality controls needed to ensure those systems are working with reliable data. You can’t bolt on quality as an afterthought when AI is involved.

EW: Many organizations are now moving from rule-based data checks to AI-assisted or anomaly-driven validation. What are the key considerations for teams trying to adopt these next-generation approaches?

ST: A change in perspective is necessary for the transition. Conventional rule-based validation is straightforward: you outline what constitutes “good” and rule out anything else. However, AI-assisted validation works differently. It learns what normal patterns look like and flags deviations, which means you’re dealing with probabilities rather than binary pass/fail outcomes.

The first consideration is establishing baseline patterns. Your AI models need enough historical data to understand what’s normal for your specific environment. You can’t just deploy these tools and expect immediate results. Second, you need to calibrate sensitivity. Set thresholds too high and you’ll miss real issues; set them too low and you’ll overwhelm teams with false positives.

What I recommend is a hybrid approach. Keep your critical business rules for the non-negotiables like referential integrity and required fields, but augment them with AI-driven anomaly detection for subtler issues. For example, we use rule-based checks to ensure employee records have required fields, but ML models to detect unusual patterns in compensation data that might indicate errors or fraud.

The other key consideration is explainability. When an AI system flags something as anomalous, your team needs to understand why. Tools that provide clear explanations of what triggered an alert are essential for building trust and enabling quick remediation

EW: Data observability platforms have grown rapidly. How do you see the balance between traditional data quality metrics and modern observability methods evolving?

ST: Instead of competing, they are convergent. Timeliness, correctness, consistency, and completeness are still essential traditional quality metrics. You still need to determine whether values are outside of permissible ranges or whether entries lack necessary fields. However, observability provides important context that is overlooked by typical measurements.

Observability platforms track violations, schema evolution, volumetric anomalies, and pipeline performance degradation. It answers questions that other traditional quality checks fail to do. For example: Why did data volume drop 30% today? Which upstream change caused this downstream failure? How long will it take to process this batch?

Integration of modern platforms, such as Microsoft Purview, ties catalog, quality, and governance together in unified workflows. It’s evolved to where if the quality check fails, the system will automatically flag affected datasets, notify responsible teams, and restrict access until someone fixes the problem. The focus has shifted from using separate tools for quality and observability to inheriting a more comprehensive view of data health.

The balance I advocate for is using traditional metrics to define what constitutes quality, and observability platforms to monitor whether you’re achieving it continuously across your entire data ecosystem. Track Service Level Indicators and Service Level Objectives specifically for data quality, treating it with the same importance as application uptime  

EW: Governance remains central to data quality. How can companies ensure that governance frameworks keep pace with the speed of continuous data ingestion and real-time analytics?

ST: A governance that is policy-driven, automated, and federated are key to ensuring that the frameworks are up to date with data analytics. The traditional approach of centralized government boards reviewing every change isn’t efficient when there is a continuous influx of real-time pipelines and ingestion.

The trend is toward data mesh concepts, in which domain teams are in charge of their own products yet function within the constraints of organizational governance. By establishing automatic procedures that enforce standards continuously, you eliminate the need for a central team to manually examine everything. For instance, regardless of whose team controls the dataset, personally identifiable information is automatically classified and secured.

When source system owners and data platform teams formalize agreements, data contracts become crucial. Governance is more immersed into the architecture and the system starts relying on compliance rather than manual oversight.

For real-time analytics, you need governance controls that operate at the same speed as your data. This means automated data classification, real-time access controls based on data sensitivity, and continuous compliance monitoring rather than periodic audits. Modern platforms like AWS Lake Formation and Google Dataplex provide these capabilities natively.

The key is shifting from “governance as gatekeeping” to “governance as enablement.” Make it easy for teams to do the right thing by providing self-service tools with built-in guardrails, rather than making them wait for approval for every action.

EW: You have led data architecture initiatives in complex enterprise environments. What are the biggest implementation challenges teams face when trying to operationalize data quality across multiple business units or cloud platforms?

ST: The technical challenges are actually easier to solve than the organizational ones. From a technical standpoint, inconsistent tooling and fragmented architectures create headaches. When different business units use different cloud platforms or have their own quality tools, establishing consistent standards becomes difficult. You end up with quality rules that work in one environment but need to be rewritten for another.

The larger issue, though, is aligning organizational priorities. Priorities and definitions of quality vary throughout business units. What finance deems “acceptable” data might not be sufficient for HR analytics or ESG reporting. It takes an expensive amount of stakeholder involvement to get everyone to agree on similar quality standards and measurements.

The talent disparity is another significant obstacle. Implementing modern data quality frameworks requires expertise in cloud platforms, data engineering, governance, and increasingly AI and machine learning. Many organizations struggle to find people with this combination of skills, or they have expertise siloed in different teams that don’t collaborate effectively.

What’s worked for me is establishing centers of excellence that provide shared quality frameworks, tools, and expertise that business units can leverage. Rather than each unit building everything from scratch, they adopt common platforms with customization for their specific needs. This balances standardization with flexibility.

The cost-benefit conversation is also challenging. Executives see data quality as overhead rather than value creation. Demonstrating ROI requires tracking metrics like reduced data incidents, faster time to insights, improved decision accuracy, and decreased remediation costs. You need to speak the language of business value, not just technical excellence.

EW: How are organizations aligning data quality initiatives with broader business goals such as AI readiness, customer experience, or ESG reporting?

ST: The alignment is becoming more natural as executives recognize that data quality directly enables strategic objectives. For AI readiness, organizations are establishing “AI-grade” quality standards that go beyond traditional metrics. This includes bias detection, semantic consistency validation, and comprehensive lineage tracking so you can explain how AI models make decisions.

In my work with ESG and Human Capital data lakes, quality alignment is explicit. Our data supports 10-K reporting and regulatory compliance, where accuracy isn’t optional. We’ve implemented automated quality checks that ensure ESG metrics meet reporting standards, and we track data lineage so auditors can verify how numbers were calculated.

For customer experience, quality enables personalization at scale. When you have clean, comprehensive customer data, you can deliver relevant experiences across touchpoints. Poor quality leads to things like sending promotions for products customers already own, or addressing communications to the wrong person, direct customer experience failures.

What I’ve seen work best is establishing quality KPIs that directly tie to business outcomes. For instance, we measure how data quality improvements reduce customer service call volumes or enable faster regulatory reporting. When quality initiatives have clear business metrics attached, they get appropriate investment and priority.

The other critical element is involving business stakeholders in defining quality requirements. Rather than technical teams deciding what quality means, business units specify what they need to achieve their objectives, and engineering teams build the frameworks to deliver it.

EW: Data engineers often struggle to justify data quality investments. What metrics or success indicators best demonstrate business impact to leadership?

ST: You have to translate technical improvements into business language. Metrics like “data completeness improved from 92% to 98%” don’t resonate with executives. Instead, focus on outcomes: “We reduced time to generate quarterly reports from 10 days to 3 days” or “We prevented $2M in potential compliance fines by catching data issues before regulatory submission.”

The metrics I’ve found most effective are incident reduction tracking how many data-related problems reach production and impact business operations. Every prevented incident has a quantifiable cost in terms of staff time to investigate, business decisions delayed, and potential revenue impact. When you can show that quality investments reduced incidents by 60%, that’s compelling.

Time to insight is another powerful metric. When analysts spend less time questioning and cleaning data and more time generating insights, that’s measurable productivity gain. We track how much time business users spend on data preparation versus actual analysis, and quality improvements directly impact that ratio.

For AI and ML initiatives, model accuracy and reliability metrics are pivotal. If better data quality improves model performance by even a few percentage points, that can translate to significant business value depending on the use case. For customer churn prediction, better data might mean preventing hundreds of thousands in lost revenue.

Finally, risk avoidance is a key metric, especially for regulated industries. Quality failures that could lead to regulatory fines, compliance violations, or reputational damage represent quantifiable risks that quality investments mitigate.

EW: Looking ahead, where do you see automation and AI making the biggest difference in ensuring trustworthy, high-quality data pipelines?

ST: The biggest impact will be in moving from reactive to proactive quality management. Today, even with automated checks, we’re mostly detecting problems after they occur. AI will make predictive quality possible by spotting potential issues before they arise by analyzing data properties, pipeline activity patterns, and past occurrences.

I’m particularly excited about AI-generated validation rules. Modern platforms using large language models can analyze transformation logic and automatically suggest appropriate quality checks. Some systems can even generate test cases and propose fixes when checks fail. This shifts quality work from manual rule writing to reviewing and refining AI suggestions, dramatically increasing coverage and speed.

A further significant area is self-repairing pipelines. In addition to identifying quality problems, future systems will use AI to automatically apply fixes, such as resolving inconsistent formatting, filling in missing numbers based on patterns learnt, or balancing contradicting records from several sources. Human oversight remains critical, but routine corrections happen automatically.

Semantic validation using embeddings and contextual AI will detect consistency issues that traditional rules miss. These systems understand meaning, not just format, which matters especially for AI applications that need semantically correct data. For example, detecting when a product description doesn’t match its category, even though both fields individually contain valid data.

The integration of quality with data catalogs and lineage tools will also advance significantly. When quality issues are detected, systems will automatically identify all affected downstream assets and consumers, enabling targeted remediation based on actual business impact rather than guessing.

EW: You are involved in mentoring and training young engineers. What advice would you give to data professionals aiming to future-proof their skills in this fast-changing ecosystem of data quality, AI, and governance?

ST: First, build a strong foundation in data fundamentals. Technologies change, but principles around data modeling, pipeline architecture, and quality dimensions remain constant. Have a thorough grasp of how data moves through systems and what can go wrong at each stage.

Second, embrace continuous learning, particularly around cloud platforms and AI technologies. The tools I used five years ago are largely obsolete now. Stay current with major platforms like Google Cloud, AWS, and Azure, and understand how their native quality and governance tools work. Experiment with AI-assisted development tools, they’re becoming essential for productivity.

Third, develop business acumen alongside technical skills. The most valuable data engineers understand not just how to build pipelines, but why certain data matters to the business and how quality issues impact outcomes. Learn to speak the language of business value and ROI, not just technical metrics.

Fourth, focus on governance and compliance. As regulations around data privacy, AI transparency, and ESG reporting expand, professionals who understand both technical implementation and regulatory requirements will be in high demand. These skills differentiate you from engineers who focus purely on data movement.

Finally, develop collaboration skills. Modern data quality requires coordination across business units, technical teams, and leadership. Engineers who can build a proper consensus, translate between technical and business audiences, and promote trust across organizational boundaries will advance faster as opposed to those with purely technical focus.

I also encourage young engineers to contribute to open-source projects and engage with professional communities. The data engineering field evolves rapidly, and learning from others’ experiences accelerates your development significantly.

Source: How Data Engineering Leader Sumit Tewari Is Advancing Data Quality Frameworks for the AI-Powered Enterprise

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button