comphimat: Classification in Data Mining

Classifier Tutorial….

· Classification, which is the task of assigning objects to one of several predefined categories.

· Classifier technique is method/model to do the classification.

· The input for this model is a set of attributes and the output is label of a class (category).

· The input data for a classification task is a collection of records. Each record is characterized by a tuple (x, y), where x is the attribute set and y is a special attribute, designated as the class label (also known as category attribute or target attribute).

· The attributes in x can be both continuous and discrete attributes, but the target attribute must be a discrete attribute. This is a key characteristic that distinguishes classification from regression, a predictive modeling task in which y is continuous attribute.

· Definition of Classification: Classification is the task of learning a target function f that maps each set x to one of the predefined class labels y.

· Target function is also known as Classification model.

· Classification model can be used in Descriptive Modeling, the task of listing out the attributes of set x for a specific value of y, and in Predictive Modeling, the task of predicting the value of target attribute for unknown records (Records with missing target attribute).

· Classification techniques are most suited for predicting or describing data sets with binary or nominal categories. This technique is not well suited for ordinal, which does not consider the implicit order among the categories, and relationship among categories.

Approach to solving Classification Problem…

· A classification technique, or a classifier, is a systematic approach to building classification models from an input data set.

· Examples: Decision trees, Rule-based classifiers, Neural Networks, Support vector machines and Naïve Bayes Classifiers.

· All above mentioned algorithms employs a learning algorithm to identify a model that best fits the relationship between the attribute set and class label of the input data. The key objective of learning algorithm is to build models with good generalization capabilities; i.e., models that accurately predict the class labels of previously unknown records.

· The fundamental need for learning algorithm is Train Data Set, Data set with target attribute known. This data set is used to build the classification model. After finding a classification model we need to apply the same on Test Data Set, data set containing unknown records.

· Evaluation of a classification model is based on the counts of test records correctly and incorrectly predicted by the model. These counts are tabulated in a matrix called Confusion matrix. The below table depicts the confusion matrix for binary classification problems.

		Predicted Class
		Class=1	Class=0
Actual…	Class=1	f₁₁	f₁₀
…Class	Class=0	f₀₁	f₀₀

· f_ij in this table denotes the number of records from class i predicted to be of class j, for instance f₀₁ indicates the number of records from class 0 incorrectly predicted as class 1. So the total number correct predictions is given by ( f₀₀+ f₁₁ ) and number of wrong predictions is given by ( f₁₀ + f₀₁ ).

· So accuracy is the ratio of correct predictions to the sum of both types of predictions. In contrast error rate is the ratio of incorrect predictions to the sum of both types of predictions.

· A best classification model is one which has more accuracy and less error rate.

Decision Trees: Hunt’s Algorithm:

· These types of trees contain 3 types of nodes wiz root node, internal node and leaf node.

· A root node that has no incoming edges and zero or more outgoing edges.

· Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges.

· Leaf or terminal nodes, each of which has exactly and no outgoing edges and these nodes represent the one of the class label.

· The non-terminal node, which includes root and internal nodes, contains attribute test conditions to separate records that have different characteristics.

· One of the well known methods to build this model is Hunt’s Algorithm.

· Of course, Hunt’s algorithm is a Learning Algorithm as discussed earlier.

· In hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning the training records into successively purer subsets. Let D_t be the set of training records that are associated with node t and y = { y1, y2, y3, … ,yn } be the class labels. The following is a recursive definition of Hunt’s algorithm.

1. If all the records in D_t belong to the same class y_t,then t is a leaf node labeled as y_t.

2. If D_t contains records that belong to more than one class, an attribute test condition is selected to partition the records into smaller subsets. A child node is created for each outcome of the test condition and the records in D_t are distributed to the children based on the outcomes. The algorithm is then recursively applied to each child node.

· It is possible for some child nodes created in step 2 to be empty; i.e., there are no records associated with these nodes. This can happen if none of the training records have the combination of attribute values associated with such nodes. In this case the node is declared as leaf node with same class label as the majority class of training records associated with its parent node.

· In step 2, if all the records associated with D_t have identical attribute values (except for the class label), then it is impossible to split these records any further. In this case, the node is declared as a leaf node with same label as the majority class of training records associated with this node.

Design issues of Decision trees…

· A learning algorithm must address the following issues…

· How should the training records be split?

· How should the splitting procedure stop?

Below we have some methods for attribute test conditions

Binary Attribute- trivial

Nominal Attribute- Multi way or binary

Ordinal Attributes- Multi way or binary unless partition does not violates the order.

Continuous Attribute- Binary( by selecting best partition) or multi way ( discretization of continuous values by preserving order ).

comphimat

Saturday, October 16, 2010

Classification in Data Mining

No comments:

Post a Comment

About Me

Blog Archive