Photo: Tommaso Dorigo |
The problem is called, in statistical terms, as one of anomaly detection. In other words, you have an otherwise homogeneous dataset (with many different features per each event -okay, per each "example" if you are a statistician-, which may or may not be contaminated by some small extraneous events, drawn from a different multi-dimensional probability density function...
In statistics there is a lemma, the Neyman-Pearsons lemma, which explains that for "simple" hypothesis testing (when e.g. you want to compare the "null" hypothesis that your data is only drawn from a background distribution, to a "alternative" hypothesis that e.g. the data contains both background and a specified signal) the most powerful test statistic is the likelihood ratio of the two densities (describing signal and background). No machine learning or god-given algorithm can do better than that. On the other hand, if you do NOT know the density of signal then the alternative hypothesis is unspecified. This creates the situation that no test statistic may ever claim to be more powerful in distinguishing the null and alternative hypothesis, as the power of any given test statistic will depend on the unknown features of the signal. In other words, it does not matter how fast you run if you don't know where you are going.
So, the win of a basic statistical learning tool over complex deep learning tools should not surprise you.
Read more...
Additional resources
Machine Learning For Jets: A Workshop In New York by Tommaso Dorigo, experimental particle physicist, who works for the INFN at the University of Padova.
Source: Science 2.0