Into

Modules

Documentation

moduleClassification plug-in

Pattern classification and other machine learning algorithms.

The classification plug-ins contains implementations of many learning and classification techniques. This chapter describes the key concepts common to many of them.

Samples and features

Samples are abstract entities represented by N > 0 features. Typically, features are represented as N-dimensional real-valued vectors. A feature can however be a text string, an object boundary represented as a list of coordinates, a graph, or a composition of all of these. Independent of the type of the features, each sample represents a vector in an N-dimensional input space. In documentation, a sample is typically denoted by x.

In Into, features are represented with random-access iterators. That is, the feature vector type must have an iterator that is indexable using operator[](int). Index 0 is the first feature and index N-1 the last one. Valid feature vector types include double*, std::vector<int>::iterator, and QVariantList::iterator.

Sample sets

To actually use samples with learning machines, one needs more than one of them. In Into, a sample set is a randomly accessible collection of samples with functions for querying the size of the set and the number of features. Each sample in a sample set must have an equal number of features. Formally, , where M is the number of samples in the set.

To be able to use a data structure as a sample set Into needs to be able to query and modify it in various ways. The required operations are defined in PiiSampleSet that wraps the actual data type used to store the samples. The default implementation works with Qt container types (QList, QVector) and a specialization is provided for PiiMatrix. If other types are used, the structure must be specialized correspondingly.

Labels

A label indicates the class to which a sample belongs. In the literature, a class label is typically represented by an integer denoting the index of a class within a discrete set of classes. Into uses QVector<double> as the container for class labels. This allows one to use the same label type to both classification and regression ("continuous classification") tasks. Whenever a class index is needed instead of a continuous output value, the double is simply casted to an int. An unknown value is denoted by NaN.

With most algorithms, the size of the label set must match that of the corresponding sample set. That is, each sample must have an associated label. In documentation, labels are usually denoted by c (for class). Sometimes, a sample set is defined as a set of (feature vector, label) pairs. For example, a set of samples with binary classifications can be formally defined as . In code, however, sample and label sets are treated as distinct entities.

Distance measures

As the name implies, distance measures are used to measure the dissimilarity or distance between two samples. A distance measure is a function that maps two feature vectors into a real number: . The definition of a distance is quite relaxed: it is sufficient that the function returns a larger value as the diversity between feature vectors grows. The distance can be negative.

In code, distance measures are function objects that take three arguments: the feature vector of a sample, that of a model, and the number of features to consider. The following two declarations are valid distance measures:

 double myDistance(double* sample, double* model, int len);

 struct MyDistance
 {
   double operator() (QVector<double>::const_iterator sample,
                      QVector<double>::const_iterator& model,
                      int len) const;
 };

Distance measures are used by algorithms such as NN, k-NN and SOM to measure the dissimilarity between code vectors. PiiDistanceMeasure is a polymorphic implementation of the concept and used when run-time changes to distance measures are needed.

Kernels

Kernels are relatives to distance measures in that they share the same interface. Their meaning in mathematical sense is however quite different.

The kernel trick is a method of converting a hyperplane (linear) classifier into a non-linear one. A kernel funtion is used in converting a linear input space non-linearly into a high-dimensional feature space, in which a linear classifier can find a solution. This is done using Mercer's theorem, which states (approximately) that any continuous, symmetric, positive semi-definite function can be expressed as a dot product in a high-dimensional space. It follows that , where is the non-linear mapping function.

An interesting thing about kernels is that one does not need to actually know the mapping function or even the dimensionality of the feature space; they are implicitly defined by the kernel. Practically, replacing dot products in a linear algorithm with a kernel function results in a non-linear variation of the algorithm. To stay linear, one can always use PiiLinearKernel.

Many linear classifiers use a bias term to move the hyperplane off the coordinate system's origin. In into, the bias term is blatantly ignored with kernel methods. The penalty? Practically none. While it is required for the low-dimensional case, the practical effect of the missing bias in a high-dimensional space is to decrease the degree of freedom by one. With kernels such as the Gaussian kernel the bias term would have no effect anyway. The upside is that neither feature vectors nor kernel functions need to take the possible existence of an extra term into account.

Namespaces

namespace

Utility functions and type definitions for common classification tasks.

namespace

Contains functions and definitions for accessing sample sets in an abstract way.

Operations

class

An operation that classifies samples using a boosted cascade of weak classifiers.

class

A superclass for classifier operations.

class

An operation that maps indices into other indices.

class

An operation that maps class indices to arbitrary data.

class

An operation that builds a confusion matrix out of classification results.

class

An operation that divides incoming feature vectors by their sum.

class

Feature combiner is an operation that combines feature vectors into a larger compound feature vector.

class

An operation that classifies samples according to the k nearest neighbors rule.

class

An Ydin-compatible Perceptron classifier operation.

class

An operation that balances training sets by giving more weight to rare samples.

class

An operation that stores names of samples belonging to N different classes.

class

An Ydin-compatible SOM classifier operation.

class

PiiTableLabelerOperation can be used for classification where the classification rules are given in a table format.

class

A superclass for classifier operations that use classifiers derived from PiiVectorQuantizer.

Other classes

class

A template that implements the PiiDistanceMeasure interface by using Measure as the distance measure implementation.

class

An generic implementation of a boosted classifier.

class

PiiClassificationException is thrown when errors occur in classification.

class

An interface for classification and regression algorithms.

class

Confusion matrix is a handy tool for inspecting classification results.

class

A primitive learner that works by thresholding a single feature.

class

Default implementation of PiiBoostClassifier::Factory.

class

Gaussian kernel function.

class

K-dimensional tree.

class

An implementation of the Kernel Adatron algorithm.

class

An implementation of the Kernel Perceptron algorithm.

class

K nearest neighbors classifier.

class

An interface for learning algorithms.

class

A distance measure that combines many distance measures into one.

class

Implementation of the Perceptron algorithm.

class

A learning algorithm that just collects all incoming data into a sample set.

class

An implementation of the self-organizing map (Kohonen map).

class

A vector quantizer.

Notes (0)

Add a note

Not a single note added yet. Be the first, add yours.