Machine learning and statistical modeling to get to the right model is a process of discovery. Analyst or scientist don’t know a priori the perfect algorithmic combinations that would yield the best possible model, even if the task and problem is well understood. It is an iterative and incremental process of exploration and discovery.

Typically a data scientist will start with an initial guess using their best judgement – a mix of industry best standards and tacit knowledge from personal experiences – to come up with algorithms for a machine learning pipeline:

- feature generation – which transforms raw data into features/signals for the model
- feature normalization or transformations – clean up, center or rescale
- feature selection – keeping the best mix of features battling the curse of dimensionality
- modeling algorithm – the actual model for regression or classification
- evaluation – on a held out test set to see how well the pipeline works

Then begins an iterative process of exploration, looking for improvements across different combination of algorithms and parameter settings. Even on a small dataset, the number of possible combinations can grow dramatically.

For example, say if I had 6 different feature generation algorithms, 2 normalization options, 2 feature selection algorithms and 7 modeling algorithm, we have at least 6x2x2x7 = 168 combinations without even considering the possible hyper-parameters for each of the algorithms. Now imagine evaluating it across multiple data partitions for k-fold evaluation where k=10, it gives us 1,680 combinations. If we were working with time-series datasets where a model was trained weekly, we would evaluate the stability of the pipeline across each of the 52 weeks of the year across 2 years, giving us 17,472 combinations. A daily model evaluated the same time period would mean 122,640 combinations. Clearly this quickly becomes a Big Compute problem.

This also an embarrassingly parallel problem and lends well to Spark/Hadoop environments. Even if the datasets are small, distributing the thousands of modeling combinations across a cluster of machines can dramatically speed up the time a scientist has to spend legitimately slacking off.

Recently, this is exactly what we did for a customer. My team at Oracle helps customer from all industries realize the value Big Data & Analytics platforms can bring to their organization, by engaging in pilots and proof-of-concepts. This PoC for a leading North American commodity producer focused on improving their price forecasting capabilities. A more accurate price prediction means better opportunity to make sell-vs-hold decisions. They wanted to use the data they had (weekly commodity price, daily international exchange rates, monthly economic data) and data that they didn’t have (hourly weather) to see if this would lead to more accurate predictions. Given the short 3-week sprint, our intent was to help their analysts become more efficient going forward – i.e. scale their capacity for experimentation.

The figure above shows the results from evaluating each of the **17,472** combinations for a classification based approach that would simply predict if the price will go up or not in the following week, across a 2-year period. Each dot represents a single combination of a machine learning pipeline = [feature generation, feature normalization, feature selection, modeling algorithm, test set]. The color denotes the modeling algorithm for that run. Formulating the problem as a 2-class classification problem helps when dealing with rather noisy target, by not trying to fit the exact price too closely and also a great way to data/modeling biases. A similar approach was then used to explore a further **22,464** combination of models that predicted the actual price.

The search found a better algorithm (in red below) that predicted the commodity price within +/-5% of the actual price 73% of the time, compared to 40% for the algorithm the customer uses (in blue below). The figure below shows the narrow range of error for the newer algorithm compared the existing one, which predicts prices over and under the actual price by up to 20%.

This shotgun approach may not appeal to the machine learning purists, but it is a great way to quickly zero in on the set of combinations that consistently perform well and eliminate the combinations that added little or no value.

Big Data technologies such as Spark/Hadoop are also Big Compute technologies to scale the number of experiments that a scientist can run, making them more efficient, allowing them to explore wider and deeper than they could otherwise. In this particular case it helped to identify a new algorithm to improve the accuracy of the price forecast, which has a direct impact on the bottomline of any commodity producer.

That was a insightful read. The ‘big data’ problems we are trying to solve pales in comparison, both scope and depth.

The tech developed for Big Data is just as relevant and useful for not-so-big data. Just like in this particular case, often the final cleaned up datasets end up being relatively small, but the need for experimentation and potential benefit much bigger. 🙂