moduleClassification plug-in
Pattern classification and other machine learning algorithms.
The classification plug-ins contains implementations of many learning and classification techniques. This chapter describes the key concepts common to many of them.
Samples and features
In Into, features are represented with random-access iterators. That is, the feature vector type must have an iterator that is indexable using operator[](int). Index 0 is the first feature and index N-1 the last one. Valid feature vector types include double*, std::vector<int>::iterator, and QVariantList::iterator.
Sample sets
, where M is the number of samples in the set.To be able to use a data structure as a sample set Into needs to be able to query and modify it in various ways. The required operations are defined in PiiSampleSet that wraps the actual data type used to store the samples. The default implementation works with Qt container types (QList, QVector) and a specialization is provided for PiiMatrix. If other types are used, the structure must be specialized correspondingly.
Labels
double is simply casted to an int. An unknown value is denoted by NaN.With most algorithms, the size of the label set must match that of the corresponding sample set. That is, each sample must have an associated label. In documentation, labels are usually denoted by c (for class). Sometimes, a sample set is defined as a set of (feature vector, label) pairs. For example, a set of samples with binary classifications can be formally defined as
. In code, however, sample and label sets are treated as distinct entities.
Distance measures
. The definition of a distance is quite relaxed: it is sufficient that the function returns a larger value as the diversity between feature vectors grows. The distance can be negative.In code, distance measures are function objects that take three arguments: the feature vector of a sample, that of a model, and the number of features to consider. The following two declarations are valid distance measures:
double myDistance(double* sample, double* model, int len); struct MyDistance { double operator() (QVector<double>::const_iterator sample, QVector<double>::const_iterator& model, int len) const; };
Distance measures are used by algorithms such as NN, k-NN and SOM to measure the dissimilarity between code vectors. PiiDistanceMeasure is a polymorphic implementation of the concept and used when run-time changes to distance measures are needed.
Kernels
The kernel trick is a method of converting a hyperplane (linear) classifier into a non-linear one. A kernel funtion is used in converting a linear input space non-linearly into a high-dimensional feature space, in which a linear classifier can find a solution. This is done using Mercer's theorem, which states (approximately) that any continuous, symmetric, positive semi-definite function
can be expressed as a dot product in a high-dimensional space. It follows that
, where
is the non-linear mapping function.
An interesting thing about kernels is that one does not need to actually know the mapping function or even the dimensionality of the feature space; they are implicitly defined by the kernel. Practically, replacing dot products in a linear algorithm with a kernel function results in a non-linear variation of the algorithm. To stay linear, one can always use PiiLinearKernel.
Many linear classifiers use a bias term to move the hyperplane off the coordinate system's origin. In into, the bias term is blatantly ignored with kernel methods. The penalty? Practically none. While it is required for the low-dimensional case, the practical effect of the missing bias in a high-dimensional space is to decrease the degree of freedom by one. With kernels such as the Gaussian kernel the bias term would have no effect anyway. The upside is that neither feature vectors nor kernel functions need to take the possible existence of an extra term into account.
Namespaces
| namespace |
Utility functions and type definitions for common classification tasks. |
| namespace |
Contains functions and definitions for accessing sample sets in an abstract way. |
Operations
| class |
An operation that classifies samples using a boosted cascade of weak classifiers. |
| class |
A superclass for classifier operations. |
| class |
An operation that maps indices into other indices. |
| class |
An operation that maps class indices to arbitrary data. |
| class |
An operation that builds a confusion matrix out of classification results. |
| class |
An operation that divides incoming feature vectors by their sum. |
| class |
Feature combiner is an operation that combines feature vectors into a larger compound feature vector. |
| class |
An operation that classifies samples according to the k nearest neighbors rule. |
| class |
An Ydin-compatible Perceptron classifier operation. |
| class |
An operation that balances training sets by giving more weight to rare samples. |
| class |
An operation that stores names of samples belonging to N different classes. |
| class |
An Ydin-compatible SOM classifier operation. |
| class |
PiiTableLabelerOperation can be used for classification where the classification rules are given in a table format. |
| class |
A superclass for classifier operations that use classifiers derived from PiiVectorQuantizer. |
Other classes
| class |
A template that implements the PiiDistanceMeasure interface by using |
| class |
An generic implementation of a boosted classifier. |
| class |
PiiClassificationException is thrown when errors occur in classification. |
| class |
An interface for classification and regression algorithms. |
| class |
Confusion matrix is a handy tool for inspecting classification results. |
| class |
A primitive learner that works by thresholding a single feature. |
| class |
Default implementation of PiiBoostClassifier::Factory. |
| class |
Gaussian kernel function. |
| class |
K-dimensional tree. |
| class |
An implementation of the Kernel Adatron algorithm. |
| class |
An implementation of the Kernel Perceptron algorithm. |
| class |
K nearest neighbors classifier. |
| class |
An interface for learning algorithms. |
| class |
A distance measure that combines many distance measures into one. |
| class |
Implementation of the Perceptron algorithm. |
| class |
A learning algorithm that just collects all incoming data into a sample set. |
| class |
An implementation of the self-organizing map (Kohonen map). |
| class |
A vector quantizer. |
Add a note
Not a single note added yet. Be the first, add yours.