Tests are the only way to estimate the quality of a machine learning algorithm in practice. In order to prove your algorithm is usable, you need to design good tests. To design test you should collect the data and split them into the train set and the test set.
Machine learning theorists say that the train and the test sets should come from the single probability distribution (unless they are talking about
transfer learning or
some kinds of on-line learning). But in practice it is a bit more complicated. We are currently working on the problem of
laser scan points classification. It is not a trivial task to design tests! We have scans from different domains (aerial scans, scans from moving vehicles, stationary terrestrial scans), and for each domain we would like to have kind of universal benchmark. It means that a wide range of algorithms are supposed to be tested with the test, so the test may
not stimulate overfitting.
So, how can we split the data? To satisfy the claim of the single distribution, we can add the odd points from the cloud to the train set and even points to the test set. This is a bad idea. Suppose your classifier use 3D coordinates of a point as the features. For each point in the test set, we have a similar point in the train set. Therefore we get nearly 100% precision using such a primitive learner. Such benchmark is not enough challenging.
Well, let's split every scan into a few pieces then. If we compose the test set from different subscans, does it solve the problem? Not at all. For example, we have a number of aerial scans. The scans can be retrieved from different heights, different scanners, in different weather. So, if we add the pieces of a single scan both to the test set and to the train set, we will get a non-challenging test again. The rule is: the pieces of a single scan may not persist both in the test set and the train set, if we want to train the classifier once for the whole domain. Do the test set and the train set come from the single distribution? No! But we need to neglect the theory in favour of practice.
- Repeated random sub-sampling validation
- K-fold cross-validation
- Leave-one-out cross-validation
According to the rule, only k-fold x-validation can be used, and each fold should contain points from its own scans. But it is very laborious to label scans. It takes more than 20 hours to label a standard million-points scan. So, we cannot have a lot of scans labelled.
This is not the only problem with testing point classification. Since the single point does not tell us anything, we should consider some neighbourhood of it, and approximate it with a surface. For every point in both sets there should be some neighbourhood. The problem is solved too you put the whole scan to the set.