Oracle’s Big Data & Analytics Platform for Data Scientists

I work for Oracle, helping businesses realize the potential of data science, big data and machine learning to grow their revenues, minimize their costs and expand new opportunities to leap-frog their competition. Which means working with some amazing folks from different parts of businesses and across industries. Invariably I’m asked – So, what does Oracle offer in the space of data science and machine learning on Big Data?

Lets leave aside the machine learning and optimization solutions embedded within different Oracle products, and just focus on the platform pieces for Big Data and Analytics for today. Lets also ignore for the moment the data management questions around security, encryption and integration that are important and chiefly the concerns of the IT department. Lets focus only on what it offers for data analysts and data scientists.

Oracle’s Big Data & Analytics Platform enables data science and machine learning at scale by taking the best that open-source offers, putting it together as an engineered solution and adding capabilities and features where open-source falls short.

Oracle Big Data Cloud Service (BDCS) is essentially Hadoop/Spark in a “Box” (or rather a number of dedicated cloud based machines connected with a 40Gb/sec InfiniBand fabric making network IO between cluster nodes very fast). It runs Cloudera Enterprise version of Hadoop with engineered hardware optimized for speeding up the analytics. Analysts can use Python, R, Scala and Java for data manipulation, analytics and machine learning using open source libraries such as SparkML. Python users such as myself can use open-source libraries (e.g. numpy, scipy, pandas, scikit-learn, seaborn, folium) inside Jupyter notebooks via PySpark kernel for operating on distributed datasets.

Out of the box, R users don’t get all the benefits of SparkML, so Oracle R Advanced Analytics for Hadoop (ORAAH) addresses that gap, giving R users access to SparkML implementations of machine learning algorithms. In addition ORAAH’s own implementation of Linear Regression, Generalized Linear Models and Neural Networks are faster and more efficient than the open-source implementations within SparkML. In experiments run by Marcos Arancibia‘s team, ORAAH’s LM model training was 6x-32x faster than SparkMLlib. Similarly GLM models trained by ORAAH were 4x -15x faster than SparkMLlib. More importantly, ORAAH continues to scale  linearly despite memory constraints, where as SparkMLlib just fails.

oraah_vs_spark_ml.png

ORAAH is available for on-premise (BDA) and the cloud (BDCS).

But not everyone can code or should have to code to transform and explore data in Hadoop. Oracle Big Data Discovery (BDD) provides “citizen data-scientists” and data analysts with interactive way to find, transform and visually discover patterns or relationship within the data stored in Hadoop. It works by keeping a sample of the data in Hadoop in-memory, automatically generating graphs that describe the shape of that attribute, and allows users to interactively manipulate that data.

Once the analyst is comfortable with the transformations, he or she can apply them to the full dataset with a click of a button. It is a very nice tool for data analysts and data scientists alike in preparing a dataset before switching to Jupyter or RStudio to use the distributed machine learning algorithms in Spark or ORAAH.

Data isn’t always in a tabular form, nor does it make sense to analyze it that way. The Spatial component of Oracle Big Data Spatial & Graph (BDSG)  scales up the analysis of images and geo-spatial data using Hadoop and OpenCV with a Java interface. Just last week I finalized a patent application on a method to automate the alignment and analysis of aerial and satellite imagery to known structures, that I had prototyped earlier this year. For one potential customer wanting to scale their operations to cover 100,000 acres of agricultural properties no longer requires them to hire a team of 40 GIS specialists and making them work round to clock to keep up with the volume of imagery expected each week.

The Graph component of BDSG provides an in-memory graph engine and algorithms for fast property graph analysis. The in-memory graph engine can handle 20-30 billion edge graphs on a single node, scale out to multiple nodes as it expands beyond the limits of a single node, and perform 10-50x faster than other graph engines for finding communities, optimal paths and even product recommendations.

Analysts that have been using Oracle Advance Analytics (OAA) as part of the Oracle Databases to train machine learning models within the database or using R, can continue to use the same interface while bringing in data from Hadoop or NoSQL Databases via Oracle Big Data SQL. Big Data SQL pushes the predicate (i.e. query processing and filtering) to Hadoop or NoSQL Databases and pulls across only the smaller filtered dataset to the relational Database. This allows analysts to user SQL, Oracle Data Miner or R, while manipulating and joining datasets in Hadoop and Database.

Once the analysis is done, now comes time to tell the story. Oracle Data Visualization (DV) is an interactive data visualization and presentation platform as a desktop application or a cloud service, letting the business intelligence, analysts and scientists reveal the story hidden within the data visually.

There are also a number of things that have been announced at Oracle Open World 2016, and coming soon. One of the most exciting for data science is the Big Data Cloud Service – Compute Edition (BDCS-CE). It is an on-demand elastic compute Hadoop/Spark cluster, allowing data scientists to spin up clusters as needed, scaling it up as needed and tear them down afterwards. For an analysts perspective, it is a perfect environment to sandbox ad-hoc queries and experimentation, before operationalizing these as analytics pipelines. There is also the Event Hub Cloud Service that provides a Kafka-based streaming data platform.

Want to know more or talk about how the Oracle Big Data & Analytics team can help your business objectives? Connect with me on LinkedIn and follow me on Twitter

Big Data is also Big Compute

Machine learning and statistical modeling to get to the right model is a process of discovery. Analyst or scientist don’t know a priori the perfect algorithmic combinations that would yield the best possible model, even if the task or problem is well understood. It is an iterative and incremental process of exploration and discovery.

Typically a data scientist will start with an initial guess using their best judgment – a mix of industry best standards and tacit knowledge from personal experiences – to come up with algorithms for a machine learning pipeline:

  1. feature generation – which transforms raw data into features/signals for the model
  2. feature normalization or transformations – clean up, center or rescale
  3. feature selection – keeping the best mix of features battling the curse of dimensionality
  4. modeling algorithm – the actual model for regression or classification
  5. evaluation – on a held out test set to see how well the pipeline works

Then begins an iterative process of exploration, looking for improvements across the different combination of algorithms and parameter settings. Even on a small dataset, the number of possible combinations can grow dramatically.

For example, say if I had 6 different feature generation algorithms, 2 normalization options, 2 feature selection algorithms and 7 modeling algorithms, we have at least 6x2x2x7 = 168 combinations without even considering the possible hyper-parameters for each of the algorithms. Now imagine evaluating it across multiple data partitions for k-fold evaluation where k=10, it gives us 1,680 combinations. If we were working with time-series datasets where a model was trained weekly, we would evaluate the stability of the pipeline across each of the 52 weeks of the year across 2 years, giving us 17,472 combinations. A daily model evaluated the same time period would mean 122,640 combinations. Clearly, this quickly becomes a Big Compute problem.

This also an embarrassingly parallel problem and lends well to Spark/Hadoop environments. Even if the datasets are small, distributing the thousands of modeling combinations across a cluster of machines can dramatically speed up the time a scientist has to spend legitimately slacking off.

Recently, this is exactly what we did for a customer. My team at Oracle helps customer from all industries realize the value Big Data & Analytics platforms can bring to their organization, by engaging in pilots and proof-of-concepts. This PoC for a leading North American commodity producer focused on improving their price forecasting capabilities. A more accurate price prediction means better opportunity to make sell-vs-hold decisions. They wanted to use the data they had (weekly commodity price, daily international exchange rates, monthly economic data) and data that they didn’t have (hourly weather) to see if this would lead to more accurate predictions. Given the short 3-week sprint, our intent was to help their analysts become more efficient going forward – i.e. scale their capacity for experimentation.

testing_17k_models_using_spark

The figure above shows the results from evaluating each of the 17,472 combinations for a classification based approach that would simply predict if the price will go up or not in the following week, across a 2-year period. Each dot represents a single combination of a machine learning pipeline = [feature generation, feature normalization, feature selection, modeling algorithm, test set]. The color denotes the modeling algorithm for that run. Formulating the problem as a 2-class classification problem helps when dealing with a rather noisy target, by not trying to fit the exact price too closely and also a great way to data/modeling biases. A similar approach was then used to explore a further 22,464 combination of models that predicted the actual price.

The search found a better algorithm (in red below)  that predicted the commodity price within +/-5% of the actual price 73% of the time, compared to 40% for the algorithm the customer uses (in blue below).  The figure below shows the narrow range of error for the newer algorithm compared the existing one, which predicts prices over and under the actual price by up to 20%.

Price Error Bounds

This shotgun approach may not appeal to the machine learning purists, but it is a great way to quickly zero in on the set of combinations that consistently perform well and eliminate the combinations that added little or no value.

Big Data technologies such as Spark/Hadoop are also Big Compute technologies to scale the number of experiments that a scientist can run, making them more efficient, allowing them to explore wider and deeper than they could otherwise. In this particular case, it helped to identify a new algorithm to improve the accuracy of the price forecast, which has a direct impact on the bottom line of any commodity producer.