## Testing machine learning algorithms

Tests are the only way to estimate the quality of a machine learning algorithm in practice. In order to prove your algorithm is usable, you need to design good tests. To design test you should collect the data and split them into the train set and the test set.

Machine learning theorists say that the train and the test sets should come from the single probability distribution (unless they are talking about transfer learning or some kinds of on-line learning). But in practice it is a bit more complicated. We are currently working on the problem of laser scan points classification. It is not a trivial task to design tests! We have scans from different domains (aerial scans, scans from moving vehicles, stationary terrestrial scans), and for each domain we would like to have kind of universal benchmark. It means that a wide range of algorithms are supposed to be tested with the test, so the test may not stimulate overfitting.

So, how can we split the data? To satisfy the claim of the single distribution, we can add the odd points from the cloud to the train set and even points to the test set. This is a bad idea. Suppose your classifier use 3D coordinates of a point as the features. For each point in the test set, we have a similar point in the train set. Therefore we get nearly 100% precision using such a primitive learner. Such benchmark is not enough challenging.

Well, let's split every scan into a few pieces then. If we compose the test set from different subscans, does it solve the problem? Not at all. For example, we have a number of aerial scans. The scans can be retrieved from different heights, different scanners, in different weather. So, if we add the pieces of a single scan both to the test set and to the train set, we will get a non-challenging test again. The rule is: the pieces of a single scan may not persist both in the test set and the train set, if we want to train the classifier once for the whole domain. Do the test set and the train set come from the single distribution? No! But we need to neglect the theory in favour of practice.

One could say that it is reasonable to use cross-validation here. Well, it makes a sense. According to Wikipedia, there are three types of cross-validation:
• Repeated random sub-sampling validation
• K-fold cross-validation
• Leave-one-out cross-validation
According to the rule, only k-fold x-validation can be used, and each fold should contain points from its own scans. But it is very laborious to label scans. It takes more than 20 hours to label a standard million-points scan. So, we cannot have a lot of scans labelled.

This is not the only problem with testing point classification. Since the single point does not tell us anything, we should consider some neighbourhood of it, and approximate it with a surface. For every point in both sets there should be some neighbourhood. The problem is solved too you put the whole scan to the set.

### 13 Response to "Testing machine learning algorithms"

1. hr0nix says:

Can you state your problem more precisely? Do you need to classify points (what classes do you have, btw?) from the same type of scan (aerial, from vehicle etc) using one single classifier trained using points from some distinct scans (which can be performed using different scanners)?

Btw, as far as I know, one the best ways to do CV is to combine k-fold cv with random subsampling in a 5-times 2-fold cross validation process (http://web.engr.oregonstate.edu/~tgd/publications/nc-stats.ps.gz)

2. Unknown says:

Actually, I cannot. =) There are several possible statements, the most interesting is, as you put it, to train single classifier on different scans retrieved in different places/conditions. But I'm not still sure it is possible to do it effectively. Probably, we should have some general model, which is supposed to be specified for the particular test in some way (transfer learning could be useful here).

Classes are usually: ground, tree (forest), building; sometimes: car, wire, pole, low vegetation, fence etc. It also depends on the type of a scan.

As for random subsampling cross-validation, I'm afraid, it is not the option. First, there is a big probability that for the point in the test set there is a neighbouring point in the train set (violates "the rule"). Second, we should consider groups of points (or even the whole scans) atomic, since we need to approximate the scan with a surface (either implicitly or explicitly); it is good for MRF/CRF too.

3. Jones Morris says:

4. Haris says:

Electronic Crockmeter

5. Testing Indonesia says:

Apa itu Universal Testing Machine?

6. Unknown says:

I think things like this are really interesting. I absolutely love to find unique places like this. It really looks super creepy though!!
Best Machine Learning Training courses | best machine learning institute in chennai | Machine Learning course in chennai

7. Unknown says:

Great post!! i was looking for this kind of stuff. This informative blog will help to find out more details about Universal Testing Machine Exporter in India . Thanks for sharing it!!

8. ProPlus Logics says:

Thanks for sharing an information to us. If someone wants to know about websites and SEO Service. I think this is the right place for you!
Digital Marketing
SEO Company

9. jenifer irene says:

This is really a valuable post... The info shared is helpful and valuable. Thank you for sharing.
Air hostess training in Chennai
Airport management courses in Chennai
Ground staff training in Chennai
Medical coding training in Chennai
Fashion designing courses in Chennai
Interior design courses in Chennai

10. marry says:

Thanks for sharing such a great blog Keep posting..
Machine learning Training in Delhi
Machine learning Course in Delhi

11. Anonymous Says:

Me2call4u is random Video chat, and connect with anyone from the anywhere in the world with a single swipe.

12. manisha says:

Really a awesome blog for the freshers. Thanks for posting the information.
Machine Learning Training in Delhi

13. for. says:

Machine Learning Projects for Final Year machine learning projects for final year

Deep Learning Projects assist final year students with improving your applied Deep Learning skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include Deep Learning projects for final year into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Deep Learning Projects for Final Year even arrange a more significant compensation.

Python Training in Chennai Python Training in Chennai Angular Training Project Centers in Chennai