logo Use SA10RAM to get 10%* Discount.
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Dolll JuttLaw

975 Answers

Hire Me
Muhammad AhmerComputer science

516 Answers

Hire Me
Jyotika DasguptaStatistics

678 Answers

Hire Me
Erlanda AlvesMarketing

927 Answers

Hire Me

There are different types of errors or faults which may occur in data sets, such as missing values or rows, invalid values or formats, or duplicated values or rows


Correctness, Data Completeness, Decision Trees.

1 Introduction

There are different types of errors or faults which may occur in data sets, such as missing values or rows, invalid values or formats, or duplicated values or rows. Low quality data will result in low quality machine learning models if the model is used to learn from the data. Before using often faulty real world data and trying to find a remedial solution for observed machine learning model, we need to better understand the effects of low input data quality on the created models.

Our ultimate goal is to automate quality control of machine learning models, but to reach that the understanding the impact of a sensor producing faulty data or no data on a model trained on this data is a general requirement. The wider objective is explainable model construction. Black-box explainable AI aims at a better understanding of how ML model output depends on the model input [11]. Of particular importance is here a root cause analysis for model deficiencies. Our aim here is, based on observed model quality problems, to identify a root cause at input data level. The concrete practical benefit of this in an IoT setting for example is, that certain ML quality patterns might already point to specific problems with the data, such as outages for faulty sensors.



Therefore, we investigated different experimental scenarios with artificial and real faulty input data sets. We specifically considered 1) input data completeness and 2) input data correctness, since these are of direct relevance to IoT settings. With the ex- periments, we created situations with different faulty data sets and compare the results to find a connection between the type of faulty data and the ML quality assessment fac- tors (accuracy, precision, recall). We focus here on numeric data that would for example be collected in technical or economic applications, neglecting text and image data here. The novelty lies in the integrated investigation the quality of information that is derived from data through a machine learning approach. We proposed a quality frame-

works in [2], [1], but report on an in-depth experimental study here.


2 Related Work

Machine learning (ML) techniques have generated huge impacts in a wide range of applications such as computer vision, speech processing, health or IoT.

Input data quality is important. The issue of missing data is unavoidable in data collection [4], [13], [7], [18]. Various imputation approaches, i.e., substituting miss- ing values, have been proposed to address the issue of missing values in data mining and machine learning applications. [13] addresses missing data imputation. The authors propose a method called DIFC integrating decision tress and fuzzy clustering into an iterative learning approach in order to improve the accuracy of missing data imputation. They demonstrated DIFC robustness against different types of missing data.

Currently, missing data impacts negatively on the performance of machine learning models. Regarding concrete ML techniques, handling missing data in decision trees is a well studied problem [5]. [19] also proposed a method for dealing with missing data in decision trees. In [7], authors tackle this problem by taking a probabilistic approach. They used tractable density estimators to compute the “expected prediction” of their models. Missing data or uncertain data in general have always been a central issue in machine learning and specially classifiers. [18] focused on the accuracy of decision trees with uncertain data. The authors discovered that the accuracy of a decision tree classifier can be improved if the complete information of a data item is utilized. They extended classical decision tree algorithms to handle data tuples with uncertain data. Paper [15] describes a solution pattern that analyzed IoT sensor data and failure from multiple assets for data-driven failure analysis. The paper used univariate and multi- variate change point detection models for performing analysis and adapted precision, recall and accuracy definition to incorporate the temporal window constraint. In [17], a toolkit for structured data quality learning is presented. They defined 4 core data quality constructs and their interaction to cover the majority of data quality analysis tasks.

Focusing on decision trees and missing data, we investigate the link between source data and machine learning model as a so far unexplored AI explainability concern.


3 Method

Before presenting the results of the experiments in the following section, we introduce here our methods including the description of objectives, data and implementation. In



many applications, ML models are reconstructed continuously based on changing input data. We use experiments to determine the extent to which different input changes re- garding data quality impact on model construction quality. In more concrete terms, the question is if changes in the data quality or the model construction have a similar impact on output quality. We consider here the following ML quality attributes. Precision, also known as Positive Predictive Value (PPV), answers the question of how many selected



For input data quality, we selected two attributes that are IoT-relevant [3]: com- pleteness is the degree to which the number of data points required to reach a defined accuracy threshold has been provided and correctness is the degree to which data cor- rectly reflects an object or an event described, i.e., how close a label is to the real world. In the context of these definitions, a sample question is if minor changes in the completeness of data (as a data quality problem) or the tree depth of decision trees (as a model construction concern) have a similar impact on model accuracy. Experiments shall help to determine the scale of the impact of a given size on input variations. We use experiments to determine if certain input change patterns correlate to observable output change patterns [6]. In concrete terms, this is if minor or major changes in input and input quality result in identifiable change patterns across different output qualities (e.g., accuracy, precision, recall). The question is if observed change patterns in the ML

model output can be attributed to the root cause of that change at input data level.

predicted rain fall. The first data set was traffic data that has been taken from an applica-

tion, which consisted of daily averages of traffic and number of vehicles in 72 stations around our province in a month. The total number of rows in this data set is thus 72. The second data set was weather data consisting of the minimum and maximum tem- perature, rainfall, wind speed, humidity, pressure, cloud and rain today as features, and the target is the possibility of rain fall the next day for 49 stations. The data from both data sets consisting of only numerical values has been processed and labeled manually.

The experimental strategy was to find the effect on accuracy, precision and recall

for these quality attributes. In order to check the impact of incomplete and incorrect input data on accuracy we created two different situations for each data set. For incom- pleteness, we checked the impact of Missing Features and Missing Rows on accuracy, precision and recall. For incorrectness, we checked the impact of Invalid Features and Invalid Rows for different invalid values on accuracy, precision and recall.


Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme

Get Free Quote!

430 Experts Online