Big data is around us. However, it is common to hear from a lot of data scientists and researchers doing analytics that they need more data. How is that possible, and where does this eagerness to get more data come from?
Very often, data scientists need lots of data to train sophisticated machine-learning models. The same applies when using machine-learning algorithms for cybersecurity. Lots of data is needed in order to build classifiers that identify, among many different targets, malicious behavior and malware infections. In this context, the eagerness to get vast amounts of data comes from the need to have enough positive samples — such as data from real threats and malware infections — that can be used to train machine-learning classifiers.
Is the need for large amounts of data really justified? It depends on the problem that machine learning is trying to solve. But exactly how much data is needed to train a machine-learning model should always be associated with the choice of features that are used.
Features are the set of information that’s provided to characterize a given data sample. Sometimes the number of features available is not directly under control because it comes from sophisticated data pipelines that can’t be easily modified. In other cases, it’s relatively easy to access new features from existing data samples, or properly pre-process data to build new and more interesting features. This process is sometimes known as "feature engineering."
Features are the set of information that’s provided to characterize a given data sample. Sometimes the number of features available is not directly under control because it comes from sophisticated data pipelines that can’t be easily modified. In other cases, it’s relatively easy to access new features from existing data samples, or properly pre-process data to build new and more interesting features. This process is sometimes known as "feature engineering."
Machine-learning books will emphasize the importance of accurately choosing the right features to train a machine-learning algorithm. This is an important consideration, because an endless amount of training data, if paired to the wrong set of features, will not produce a reliable model.
Read more...
Source: Computerworld