## Data mining glossary

This glossary is designed to help readers in understanding specific data mining terminology used in this tutorial.

 accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied to models, accuracy refers to the degree of fit between the model and the data. This measures how error-free the model's predictions are. Since accuracy does not include cost information, it is possible for a less accurate model to be more cost-effective. Also see precision. categorical data Categorical data fits into a small number of discrete categories (as opposed to continuous). Categorical data is either non-ordered (nominal) such as gender or city, or ordered (ordinal) such as high, medium, or low temperatures. Given a set of predefined categorical classes for the target variable, determine to which of these classes a specific example (sample) belongs. For example, given classes of patients that correspond to certain degree of the disease, identify from the patient set of variables to which class she/he belongs. clustering (segmentation) Clustering algorithms find groups of items that are similar. For example, clustering could be used by an insurance company to group customers according to income, age, types of policies purchased and prior claims experience. It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. Since the categories are unspecified, this is sometimes referred to as unsupervised learning. confidence (data mining) Confidence of rule "B given A" is a measure of how much more likely it is that B occurs when A has occurred. It is expressed as a percentage, with 100% meaning B always occurs if A has occurred. Statisticians refer to this as the conditional probability of B given A. When used with association rules, the term confidence is observational rather than predictive. confidence (statistics) Usually refers to the probability that some interval contains the true value of a parameter (also called interval confidence). A 95% confidence interval for the mean has a probability of 0.95 of covering the true value of the mean. confusion matrix A confusion matrix shows the counts of the actual versus predicted class values. It shows not only how well the model predicts, but also presents the details needed to see exactly where things may have gone wrong. Conjunctive Normal Form (CNF) is a conjunction of clauses, where clauses are either attribute-value conditions or disjunctions of attribute-value conditions. For example, (color=red or color=green) and (shape=rectangular) is a formula in Conjunctive Normal Form (CNF). consequent (right-hand side of the rule) When an association between two variables is defined, the second item (or right-hand side) is called the consequent. For example, in the relationship "When a customer buys a beer, he also buys chips 25% of the time" "buys chips" is the consequent. continuous data Continuous data can have any value in an interval of real numbers. That is, the value does not have to be an integer. Continuous is the opposite of discrete or categorical. cross validation A method of estimating the accuracy of a classification or regression model. The data set is divided into several parts, with each part in turn used to test a model fitted to the remaining parts. data Values collected through record keeping or by polling, observing, or measuring, typically organized for analysis or decision making. More simply, data is facts, transactions and figures. A concise description of characteristics of the data in elementary and aggregated form that gives an overview of the structure of the data. This is usually a sub process in data mining process, based on simple descriptive statistical techniques and visualization (attribute value distributions, means and medians, frequency tables). data format Data items can exist in many formats such as text, integer and floating-point decimal. Data format refers to the form of the data in the database. data mining An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis. data set Data set is a set of examples. data warehouse A data warehouse is a copy of transaction data (in most of the cases) specifically structured for querying and reporting. DBMS Database management systems. decision tree A tree-like way of representing a collection of hierarchical rules that lead to a class or value. deduction Deduction infers information that is a logical consequence of the data. dependency analysis Dependency analysis aims to find models that describe significant dependencies (or associations) between data items or events. Dependencies can be used to predict the value of a data item given information on other data items. Dependencies can be strict or probabilistic. Examples: Association rules, Bayesian networks.
 deployment After the model is trained and validated, it is used to analyze new data and make predictions. This use of the model is called deployment. dimension usually reffers to an attribute of an example in the data being mined. Stored as a field in a flat file record or a column of relational database table. discrete data A data item that has a finite set of values. Discrete is the opposite of continuous. discriminant analysis A statistical method based on maximum likelihood for determining boundaries that separate the data into categories. Disjunctive Normal Form (DNF) is a disjunction of clauses, where clauses are conjunctions of attribute-value conditions. For example, (color=red and shape=rectangular) or (color=green and shape=rectangular) is a formula in Disjunctive Normal Form (DNF). entropy A way to measure variability other than the variance statistic. Some decision trees split the data into groups based on minimum entropy. example An example, sometimes reffered to as an instance, sample, or data-item, is an ordered set of variables. An example of an example is a set of results of diagnostic tests of a particular patient in a clinical database, or set of characteristics of some building in a database of buildings. exploratory analysis Looking at data to discover relationships not previously detected. Exploratory analysis tools typically assist the user in creating tables and graphical displays. feed-forward network A neural net in which the signals only flow in one direction, from the inputs to the outputs. genetic algorithms A computer-based method of generating and testing combinations of possible input parameters to find the optimal output. It uses processes based on natural evolution concepts such as genetic combination, mutation and natural selection. GUI Graphical User Interface. hidden nodes The nodes in the hidden layers in a neural net. Unlike input and output nodes, the number of hidden nodes is not predetermined. The accuracy of the resulting model is affected by the number of hidden nodes. Since the number of hidden nodes directly affects the number of parameters in the model, a neural net needs a sufficient number of hidden nodes to enable it to properly model the underlying behavior. On the other hand, a net with too many hidden nodes will overfit the data. Some neural net products include algorithms that search over a number of alternative neural nets by varying the number of hidden nodes, in the end choosing the model that gets the best results without overfitting. independent variable The independent variables (inputs or predictors) of a model are the variables used in the equation or rules of the model to predict the output (dependent) variable. induction A technique that infers generalizations from the information in the data. k-nearest neighbor A classification method that classifies a point by calculating the distances between the point and points in the training data set. Then it assigns the point to the class that is most common among its k-nearest neighbors (where k is an integer). Kohonen feature map A type of neural network that uses unsupervised learning to find patterns in data. In data mining it is employed for cluster analysis. labeled example Labeled example is the one for which the value of the target variable is known. layer Nodes in a neural net are usually grouped into layers, with each layer described as input, output or hidden. There are as many input nodes as there are input (independent) variables and as many output nodes as there are output (dependent) variables. Typically, there are one or two hidden layers. leaf node A node not further split -- the terminal grouping -- in a classification or decision tree. learning Training models (estimating their parameters) based on existing data. left-hand side When an association between two variables is defined, the first item is called the left-hand side (or antecedent). For example, in the relationship "If a customer buys a beer, he buys chips 25% of the time", "buys a beer" is the left-hand side. lift (chart) Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. Lift curve is showing lift as a function of examples covered by the model. maximum likelihood Another training or estimation method. The maximum likelihood estimate of a parameter is the value of a parameter that maximizes the probability that the data came from the population defined by the parameter. mean The arithmetic average value of a collection of numeric data. median The value in the middle of a collection of ordered data. In other words, the value with the same number of items above and below it. missing data Data values can be missing because they were not measured, not answered, were unknown or were lost. Data mining methods vary in the way they treat missing values. Typically, they ignore the missing values, or omit any records containing missing values, or replace missing values with the mode or mean, or infer missing values from existing values. mode The most common value in a data set. If more than one value occurs the same number of times, the data is multi-modal. model An important function of data mining is the production of a model. A model can be descriptive or predictive. A descriptive model helps in understanding underlying processes or behavior. For example, an association model describes consumer behavior. A predictive model is an equation or set of rules that makes it possible to predict an unseen or unmeasured value (the dependent variable or output) from other, known values (independent variables or input). The form of the equation or rules is suggested by mining data collected from the process under study. Some training or estimation technique is used to estimate the parameters of the equation or rules. neural network A complex nonlinear modeling technique based on a model of a human neuron. A neural net is used to predict outputs (dependent variables) from a set of inputs (independent variables) by taking linear combinations of the inputs and then making nonlinear transformations of the linear combinations using an activation function. It can be shown theoretically that such combinations and transformations can approximate virtually any type of response function. Thus, neural nets use large numbers of parameters to approximate any model. Neural nets are often applied to predict future outcome based on prior experience. For example, a neural net application could be used to predict who will respond to a direct mailing. node A decision point in a classification (i.e., decision) tree. Also, a point in a neural net that combines input from other nodes and produces an output through application of an activation function. noise In general, data is referred to as noisy when it contains errors such as many missing or incorrect values or when there are extraneous columns. nominal domains In nominal domains one can enumerate all possible variable values and there is no order relation between them. For example the set of colors {red, green, blue}, or set of sexes {male, female}, represent a nominal domain. normalization A collection of numeric data is normalized by subtracting the minimum value from all values and dividing by the range of the data. This yields data with a similarly shaped histogram but with all values between 0 and 1. It is useful to do this for all inputs into neural nets and also for inputs into other regression models. (Also see standardize.) OLAP On-Line Analytical Processing tools give the user the capability to perform multi-dimensional analysis of the data. optimization criterion (in data mining) In data mining techniques' terminology a function of the difference between predictions and data estimates that are chosen so as to optimize the function or criterion. Least squares and maximum likelihood are examples. ordered domains These are numerical domains. Sometimes it is possible to enumerate all possible values that the variable can take (integers: for example ages of patients in a database). In most of the cases this is not possible, especially with continuous (real numbers) domains. outliers Technically, outliers are examples for which one (or more) attribute value is significantly different from other (similar) examples' values for that attribute. This value lies outside the expected range of values for this example, so it represents an outlier. Outliers might indicate erroneous data collection, or might come from different part of population of examples, indicating a new phenomenon. overfitting A tendency of some modeling techniques to assign importance to random variations in the data by declaring them important patterns. pattern Analysts and statisticians spend much of their time looking for patterns in data. A pattern can be a relationship between two variables. Data mining techniques include automatic pattern discovery that makes it possible to detect complicated non-linear relationships in data. Patterns are not the same as causality. prediction Prediction (sometimes also called regression) is similar to classification. The only difference is that in prediction the target attribute is not discrete but a continuous one. The aim of prediction is to find the numerical value of the target attribute for unlabeled (unseen) examples. precision The precision of an estimate of a parameter in a model is a measure of variability of the estimate over other similar data sets. A very precise estimate would be the one that does not vary much over different data sets. Precision does not measure accuracy. Accuracy is a measure of how close the estimate is to the real value of the parameter. Accuracy is measured by the average distance over different data sets of the estimate from the real value. Estimates can be accurate but not precise, or precise but not accurate. A precise but inaccurate estimate is usually biased, with the bias equal to the average distance from the real value of the parameter. predictability Some data mining vendors use predictability of associations or sequences to mean the same as confidence. Propositional-like representations use a logic formulae, consisting of attribute-value conditions. Two alternative representations falling into this category are: Conjunctive Normal Form (CNF) and Disjunctive Normal Form (DNF). pruning Eliminating lower level splits or entire sub-trees in a decision tree. This term is also used to describe algorithms that adjust the topology of a neural net by removing (i.e., pruning) hidden nodes. range The range of the data is the difference between the maximum value and the minimum value. RDBMS Relational Database Management System. regression tree A decision tree that predicts values of continuous variables. resubstitution error The estimate of error based on the differences between the predicted values of a trained model and the observed values in the training set. right-hand side When an association between two variables is defined, the second item is called the right-hand side (or consequent). For example, in the relationship "When a prospector buys a pick, he buys a shovel 14% of the time," "buys a shovel" is the right-hand side. ROC curve Receiver Operating Characteristic curve (also referred to as Relative Operating Characteristic), graphs the false-positive ratio on the x-axis and the true-positive ratio on the y-axis for a selected category. r-squared A number between 0 and 1 that measures how well a model fits its training data. One is a perfect fit; however, zero implies the model has no predictive ability. It is computed as the covariance between the predicted and observed values divided by the standard deviations of the predicted and observed values. sampling Creating a subset of data from the whole. Random sampling attempts to represent the whole by choosing the sample through a random mechanism. sensitivity analysis Varying the parameters of a model to assess the change in its output. significance A probability measure of how strongly the data support a certain result (usually of a statistical test). If the significance of a result is said to be .05, it means that there is only a .05 probability that the result could have happened by chance alone. Very low significance (less than .05) is usually taken as evidence that the data mining model should be accepted since events with very low probability seldom occur. So if the estimate of a parameter in a model showed a significance of .01 that would be evidence that the parameter must be in the model. supervised learning The collection of techniques where analysis uses a well-defined (known) dependent variable. All regression and classification techniques are supervised. support The measure of how often the collection of items in an association occur together as a percentage of all the transactions. For example, "In 2% of the purchases at the hardware store, both a pick and a shovel were bought." target variable Target variable or target attribute, is one of the variables of an example which describes the phenomenon of interest, that is, the phenomenon we would like to make predictions about, using the independent variables or attributes. test set Test set is a data set of unlabelled examples used for testing of the performance of the model learned on the training data set. test error The estimate of error based on the difference between the predictions of a model on a test data set and the observed values in the test data set when the test data set was not used to train the model. time series A series of measurements taken at consecutive points in time. Data mining products which handle time series incorporate time-related operators such as moving average. (Also see windowing.) time series model A model that forecasts future values of a time series based on past values. The model form and training of the model usually take into consideration the correlation between values as a function of their separation in time. topology For a neural net, topology refers to the number of layers and the number of nodes in each layer. training Another term (also learning) for estimating a model's parameters based on the data set at hand. training set Training set is a data set of labelled examples used for learning the model using some data mining tool. transformation A re-expression of the data such as aggregating it, normalizing it, changing its unit of measure, or taking the logarithm of each data item. unlabeled example Unlabeled example is the one for which the value of the target variable is not known. unsupervised learning This term refers to the collection of techniques where groupings of the data are defined without the use of a dependent variable. Cluster analysis is an example. validation The process of testing the models with a data set different from the training data set. variable Variable, referred to also as an attribute or feature, takes values from a pre-defined set of values that are problem dependent, which is called the domain of the variable. Typical real-world data mining problems are of the heterogeneous kind, i.e. on the same problem we can have variables with very different domains, We can differentiate between two types of domains: nominal and ordered domains. variance The most commonly used statistical measure of dispersion. The first step is to square the deviations of a data item from its average value. Then the average of the squared deviations is calculated to obtain an overall measure of variability. visualization Visualization tools graphically display data to facilitate better understanding of its meaning. Graphical capabilities range from simple scatter plots to complex multi-dimensional representations. windowing Used when training a model with time series data. A window is the period of time used for each training case. For example, if we have weekly stock price data that covers fifty weeks, and we set the window to five weeks, then the first training case uses weeks one through five and compares its prediction to week six. The second case uses weeks two through six to predict week seven, and so on.

© 2001 LIS - Rudjer Boskovic Institute