Which algorithm to choose?

What Algorithm Should One Use?

The question often arises, out of the wide array of algorithms available to an analyst or data scientist, which one should be used?
The answer depends on the size, quality and type of data and moreover on the nature of the insight one wishes to obtain.
In addition, when selecting an algorithm, you should take into account its limitations and capabilities. In many cases, one examines several algorithms and checks which yields the best results.

There is a general categorization to 3 algorithm families: Supervised, Unsupervised and Reinforcement.

Supervised

Supervised algorithms create predictions that are based on a set of historical data. For example, one can use historical USD rates to predict risk or future dollar exchange rates. Each item of data that is used is marked with its value, in this case the US dollar exchange rate. The algorithm learns and seeks patterns in those items of data that are marked for learning. The algorithm can use any piece of data which may be relevant: day of week, season, exchange rates of other currencies, market interest rate, the occurrence of certain events on the national/international level etc. After the algorithm has found the best structure available, it can use this pattern to produce a prediction, for the explanatory data only.

Supervised Learning

This type of algorithm is frequently used in the industry and is easy to use and understand. There are "Classification" algorithms, where data is used to predict a category and "Regression" algorithms, where one wishes to predict a certain value (blood pressure, share price, height, weight etc.). Examples of these kinds of algorithms: Random Forests, Decision Trees, Logistic Regression, Linear Regression and Neural Networks.

Example Scenario: when trying to ascertain which customers are appropriate for a certain marketing offer, one collects a set of past customers, where some of the customers have this product or service and some do not. Customers who have the product are marked with "1" and customers who are without the product are marked with "0". The model learns, based on historical data, what customers are most likely to purchase the product. If the model is good, then new customers can be checked by running the model, resulting in a score assigned to that customer, which represents the probability that he/she will purchase the product. Those who score the highest are then passed on to the salespersons or the call center for follow-up.

Unsupervised

The goal of this learning algorithm is to structure data in a new way, or describe this structure. The goal here is to find similar segments or branches. Using this method, complex data can become more structured and accessible. Examples of such algorithms: k-means, hierarchical clustering etc.

Example: As I wrote in a previous post, you can produce a customer segmentation in order to determine in which stage of the customer cycle they are and in order to customize their experience and marketing offers for maximum effect. Once we know the segment the customer belongs to, we can produce a tailormade customer journey, to maximize ROI. In a previous example, I used k-means for this purpose.

Reinforcement learning

The algorithm is asked to choose an action in response to data. The reinforced learning algorithm also receives a feedback grade, depending on the quality of the response. The algorithm tunes its responses in order to receive the highest grades. These algorithms are more rare in our industry, and are more in use in autonomous vehicles, for example, where deep reinforced learning is used.

Considerations for Picking an Algorithm

When choosing an algorithm, one should consider the accuracy of prediction, model learning/training time, number of explanatory variables and how this affects training time. Moreover, one must consider applicability, the model's linearity assumption and if it fits the data at hand.

In summary, selection of the most appropriate model is highly dependent on the business problem that one wishes to solve, and knowing the advantages and disadvantages of each model.
Even the best data scientists consider a number of models before ultimately picking one, in order to get the best results.
Deep understanding of the data structure and type of data is crucial for selecting the correct model, and one must invest much time and effort in preparation and research before running models, in order be time efficient and not waste resources on a model which is not relevant for the available data.

Shalom Dinur,

Senior Data Scientist,

Zvika Yaron, VP Sales,

DataCube.