Toward Consistent Data Quality Assessment in Heterogeneous Smart Living Systems
TL;DR: A prototype that uses LLMs to automate data quality checks for complex smart living datasets. Smart living environments, powered by a diverse array of IoT devices and software, generate vast amounts of data. For these data to be useful in AI-based services, they must be of high quality. However, the extreme variety in data structures—a result of many different providers and use cases—makes consistent quality assessment a significant challenge.
How can we reliably assess data quality when every dataset has a different, often unknown, structure?
To address this, we developed a prototype that automates the generation of data quality reports. The system uses a Large Language Model to analyze a dataset, infer its structure, and suggest applicable quality checks. This structured metadata then serves as input for a codebase that performs standardized quality assessments.
Our approach enables a consistent method for evaluating data quality, not only within a single system but also across different organizations and providers. The prototype was evaluated with both synthetic and real-world datasets, demonstrating the feasibility of using LLMs to manage the complexity of heterogeneous data.