Impactful Data Scientists

In 2012, Davenport and Patil’s article in Harvard Business Review titled Data Scientist: The Sexiest Job of the 21st Century, raised the profile of a profession that had been naturally evolving in the modern computing era – an era where data and computing resources are more abundantly and cheaply available than ever before. There was also a shift in our industry leaders adopting a more open and evidence-based approach to guiding the growth of their business. Brilliant data scientists with machine learning and artificial intelligence expertise are invaluable in supporting this new normal.

While there are different opinions on what defines a data scientist, as the leader of the Data Science Practice at Think Big Analytics, the consulting arm of Teradata, I expect data scientist on my team to embody specific characteristics. This expectation is founded on a simple question – Are you having a measurable and meaningful impact on the business outcome?

Any data scientist can dig into data, use statistical techniques to find insights and make recommendations for their business partners to consider. A good data scientist makes sure that the business adopts those insights and recommendations by focusing on the problems that are important to the company and making a compelling case grounded in business value. An impactful data scientist can iterate quickly, address a wide variety of business problems for the organization and deliver meaningful business impact swiftly by using automation and getting their insights integrated into production systems. Consequently, impactful data scientists more often answer ‘yes‘ to the question above.

So what makes a Data Scientist impactful? In my experience, they possess skillsets that I broadly characterize as that of a scientist, a programmer, and an effective communicator. Let us look at each of these in turn.

what_is_a_data_scientist_2.png

Firstly they are a scientist. Data scientists work in highly ambiguous situations and operate on the edge of uncertainty. Not only are they trying to answer the question, they often have to determine what is the question in the first place. They have to ask vital questions to the understand the context quickly, identify the root of the problem that is worth solving, research and explore the myriad of possible approaches and most of all manage the risk and impact of failure. If you are a scientist or have undertaken research projects, you would recognize these as traits of a scientist immediately.

In addition, data scientists are also programmers. Traditional mathematicians, statistician, and analysts who are comfortable using GUI-driven analytical workbenches that allow them to import data and build models with a few clicks often contest this expectation. They argue that they don’t need computer science skills since they are supported by (a) team of data engineers to find and cleanse their data, and (b) software engineers to take their models and operationalize them by re-writing them for the production environment. However, what happens when data engineers are busy, or the sprint backlog of IT department means the model that a data scientist has just found to make a company millions won’t make it to production for the next 6-9 months? They wait, and their amazing insights have no impact on the business.

Programming and computer science skills are essential for data scientists so that they are not ‘blocked’ by organizational constraints. A data scientist shouldn’t have to wait for someone else to find and wrangle the data they need, nor be afraid of getting their hands dirty with the code to ensure their models make it to production. It also means, data scientist do not become a bottleneck to their organization by automating their solutions for production or automatic reports. Given the highly distributed and large volume transactions in online, mobile and IoT applications means data scientists need to consider the design of their solution for scale. For example, will their real-time personalization model scale to the 100,000 requests per second for their company’s website and mobile app?

Finally, a data scientist should be an effective 2-way communicator. Not only should they empathize to understand the business context and customer needs, but also convey the value of their work in a manner that appeals to them. One of the hardest skill to master for some knowledgeable data scientists is often the ability to influence organizations without authority. A data scientist that goes around asserting that everyone should listen to them because he or she has data and insights without cultivating trust is likely to earn them the title of a prima donna and not achieve the impact that they can with those insights. Effective communication is relatable, precise and concise.

Data scientists with these three broad skillsets are in an excellent position to have a meaningful and measurable impact on the business outcomes, making them highly valuable to any organization. Of course, this list doesn’t talk about innate abilities like creativity, bias for action and a sense of ownership. Neither does it consider the organizational culture that may either support or hider their impact. I have focused on skills that can be developed through training and practice. In fact, these are essential elements to the growth and career paths for my team of brilliant and impactful data scientists at Think Big Analytics. 

Credits:

AI & ML – Lessons learnt and real-world challenges

Just before I flew back to Seattle, I gave a talk last week at my alma mater – School of Computer Science & Engineering at UNSW, Australia. It was great to see some familiar faces and meet some new ones that I hope feel more compelled to tackle some interesting problems in data science, machine learning (ML) and artificial intelligence (AI).

In this talk, I shared some the personal lessons that I learnt as part of building AI & ML solutions at companies like Amazon and Oracle. I also opened up about my fears of these technologies, as well as the challenges that the industry faces in delivering intelligent systems for the 99% (?) of businesses. You can find the slides from the talk (PDF) for the references and links that I mentioned. Just send an email to ( avishkar @ gmail dot com) with the subject “AI & ML” to get the password to the PDF.

The most important message that I wanted to impart to the room full of researchers, academics, and industry practitioners was how do we collectively address the shortage of skills needed to develop AI and ML solutions to the broad range of business problems beyond the top 1% of leading-edge tech companies. Education, standards and automated tools can help ensure a certain base level of competency in the application of AI & ML.AddressingSkillsShortage.jpg

The vast majority of the businesses out there are not Google, Amazon or Facebook, with deep pockets and years of R&D experience to tackle the challenge of applying AI and ML. Everyone from schools (i.e. universities) and industry responsible for growing this field must also develop standards and tools that ensure a certain level of quality is maintained for the solutions that we put into production. We have had standards when it comes to mechanical and civil engineering to ensure that things that can impact people’s lives and safety adhere to a certain quality standard. Similarly, we should also develop standards and encourage organizations to validate compliance with those standards when it comes to developing AI & ML solutions with far-reaching consequences.

BiasedDataBiasedModels.jpg

A simple and very personal example was that one of my own photos was rejected by the automated checks to verify that a passport photo complies with the requirements for visas. The fact that the slightly “browner” version of me (left) failed the check seems to suggest an inherent bias in the system due to the kind of data used to build the system. Funny but scary. How many other “brown” people have had their photos rejected by such a system?

Other examples would be Human Resource systems that identify potential candidates, suggests no-/hire decisions or recommends salary packages to new hires. If the system is trained on historical data and uses gender as a feature, is it possible that the system could be biased against women for high-profile or senior positions? Afterall historically women have been under-representative in senior positions. Standards and compliance verification tools can help us identify such biases, ensuring that data and models do not introduce biases that are unacceptable in a modern and equitable society.

Academics, researchers, and industry practitioners cannot absolve themselves of the duty of care and consideration when developing systems that have a broad social impact. Data scientists must think beyond the accuracy metric and the whole ecosystem in which the system operates.

Image Credit:

  • Modeling API by H Alberto Gongora from the Noun Project
  • education by Rockicon from the Noun Project
  • tools by Aleksandr Vector from the Noun Project
  • Checklist by Ralf Schmitzer from the Noun Project

Plant Science Initiative @ NC State University

In my role at Oracle, I get to work across many industries on some very interesting problems. One that I have been involved with recently is the collaboration between North Carolina (NC) State University and Oracle with NC State’s Plant Science Initiative.

In particular, we’ve been working with the College of Agriculture and Life Sciences (CALS) to launch a big data project that focuses on sweet potatoes. The goal is to help geneticists, plant scientists, farmers and industry partners in the sweet potato industry to develop better varieties of sweet potatoes, as well as speed up the pace with which research is commercialized. The big question is can we use the power of Big Data, Machine Learning, and Cloud computing to reduce the time it takes to develop and commercialize a new variety of sweet potato crop from 10 years to three or four years?

One of the well-known secrets to driving innovation is scaling and speeding up experimentation cycles. In addition, reducing the friction associated with collaborative research and development can help bring research to market more quickly.

My team is helping the CALS group to develop engagement models that facilitate interdisciplinary collaboration using the Oracle Cloud. Consider geneticists, plant science researchers, farmers, packers, and distributors of sweet potato being able to contribute their data and insights to optimize different aspects of the sweet potato production – sweet potato from the genetic sequence to the dinner plate.

I am extremely excited by the potential impact open collaboration between various stakeholders can mean for the sweet potato and precision agriculture industry.

More details at cals.ncsu.edu

It is a go for Amazon Go!

The super secret exciting project that I spent days and nights slogging over when I was at Amazon has finally been announced – Amazon Go. A checkout-less, cashier-less magical shopping experience in a physical store. Check out the video to get a sense of the shopping experience that simplifies the CX around the shopping experience. Walk in, pick up what you need and walk out. No line, no waiting, no registers.

I’m very proud of an awesome team of scientists & engineers covering software, hardware, electrical and optics that rallied together to build an awesome solution of machine learning, computer vision, deep learning and sensor fusion. The project was an exercise in iterative experimentation and continually learning, refining all aspects of the hardware, software as well as innovative vision algorithms. I personally was involved in 5 different prototypes and the winning solutions that ticked all the boxes more than 2 years ago.

I remember watching Jeff Bezos and the senior leadership at Amazon, playing with the system by picking and returning the items back to the shelves. Smiles and high-fives all around as the products were added and removed from the shopper’s virtual cart, with the correct quantity of each item.

Needless to say there is a significant effort after the initial R&D is done to move something like this to production, so it is not surprising that it has taken 2 years since then to get it ready for public. Well done to my friends at Amazon for getting the engineering solution over the line to an actual store launch for early 2017.

Photo Credit: Original Image by USDA – Flickr

 

Oracle’s Big Data & Analytics Platform for Data Scientists

I work for Oracle, helping businesses realize the potential of data science, big data and machine learning to grow their revenues, minimize their costs and expand new opportunities to leap-frog their competition. Which means working with some amazing folks from different parts of businesses and across industries. Invariably I’m asked – So, what does Oracle offer in the space of data science and machine learning on Big Data?

Lets leave aside the machine learning and optimization solutions embedded within different Oracle products, and just focus on the platform pieces for Big Data and Analytics for today. Lets also ignore for the moment the data management questions around security, encryption and integration that are important and chiefly the concerns of the IT department. Lets focus only on what it offers for data analysts and data scientists.

Oracle’s Big Data & Analytics Platform enables data science and machine learning at scale by taking the best that open-source offers, putting it together as an engineered solution and adding capabilities and features where open-source falls short.

Oracle Big Data Cloud Service (BDCS) is essentially Hadoop/Spark in a “Box” (or rather a number of dedicated cloud based machines connected with a 40Gb/sec InfiniBand fabric making network IO between cluster nodes very fast). It runs Cloudera Enterprise version of Hadoop with engineered hardware optimized for speeding up the analytics. Analysts can use Python, R, Scala and Java for data manipulation, analytics and machine learning using open source libraries such as SparkML. Python users such as myself can use open-source libraries (e.g. numpy, scipy, pandas, scikit-learn, seaborn, folium) inside Jupyter notebooks via PySpark kernel for operating on distributed datasets.

Out of the box, R users don’t get all the benefits of SparkML, so Oracle R Advanced Analytics for Hadoop (ORAAH) addresses that gap, giving R users access to SparkML implementations of machine learning algorithms. In addition ORAAH’s own implementation of Linear Regression, Generalized Linear Models and Neural Networks are faster and more efficient than the open-source implementations within SparkML. In experiments run by Marcos Arancibia‘s team, ORAAH’s LM model training was 6x-32x faster than SparkMLlib. Similarly GLM models trained by ORAAH were 4x -15x faster than SparkMLlib. More importantly, ORAAH continues to scale  linearly despite memory constraints, where as SparkMLlib just fails.

oraah_vs_spark_ml.png

ORAAH is available for on-premise (BDA) and the cloud (BDCS).

But not everyone can code or should have to code to transform and explore data in Hadoop. Oracle Big Data Discovery (BDD) provides “citizen data-scientists” and data analysts with interactive way to find, transform and visually discover patterns or relationship within the data stored in Hadoop. It works by keeping a sample of the data in Hadoop in-memory, automatically generating graphs that describe the shape of that attribute, and allows users to interactively manipulate that data.

Once the analyst is comfortable with the transformations, he or she can apply them to the full dataset with a click of a button. It is a very nice tool for data analysts and data scientists alike in preparing a dataset before switching to Jupyter or RStudio to use the distributed machine learning algorithms in Spark or ORAAH.

Data isn’t always in a tabular form, nor does it make sense to analyze it that way. The Spatial component of Oracle Big Data Spatial & Graph (BDSG)  scales up the analysis of images and geo-spatial data using Hadoop and OpenCV with a Java interface. Just last week I finalized a patent application on a method to automate the alignment and analysis of aerial and satellite imagery to known structures, that I had prototyped earlier this year. For one potential customer wanting to scale their operations to cover 100,000 acres of agricultural properties no longer requires them to hire a team of 40 GIS specialists and making them work round to clock to keep up with the volume of imagery expected each week.

The Graph component of BDSG provides an in-memory graph engine and algorithms for fast property graph analysis. The in-memory graph engine can handle 20-30 billion edge graphs on a single node, scale out to multiple nodes as it expands beyond the limits of a single node, and perform 10-50x faster than other graph engines for finding communities, optimal paths and even product recommendations.

Analysts that have been using Oracle Advance Analytics (OAA) as part of the Oracle Databases to train machine learning models within the database or using R, can continue to use the same interface while bringing in data from Hadoop or NoSQL Databases via Oracle Big Data SQL. Big Data SQL pushes the predicate (i.e. query processing and filtering) to Hadoop or NoSQL Databases and pulls across only the smaller filtered dataset to the relational Database. This allows analysts to user SQL, Oracle Data Miner or R, while manipulating and joining datasets in Hadoop and Database.

Once the analysis is done, now comes time to tell the story. Oracle Data Visualization (DV) is an interactive data visualization and presentation platform as a desktop application or a cloud service, letting the business intelligence, analysts and scientists reveal the story hidden within the data visually.

There are also a number of things that have been announced at Oracle Open World 2016, and coming soon. One of the most exciting for data science is the Big Data Cloud Service – Compute Edition (BDCS-CE). It is an on-demand elastic compute Hadoop/Spark cluster, allowing data scientists to spin up clusters as needed, scaling it up as needed and tear them down afterwards. For an analysts perspective, it is a perfect environment to sandbox ad-hoc queries and experimentation, before operationalizing these as analytics pipelines. There is also the Event Hub Cloud Service that provides a Kafka-based streaming data platform.

Want to know more or talk about how the Oracle Big Data & Analytics team can help your business objectives? Connect with me on LinkedIn and follow me on Twitter

Deep Learning – expectations & opportunities

A recent post on using DSSTNE (a deep learning library that I had a minor hand in) for training a simple movie recommender, sparked off some interesting conversations around expectations we have of Deep Learning. It can basically be summed up as – Is Deep Learning the path to Artificial Intelligence or will it be a one-hit wonder liable to fall out of fashion quickly?

Having developed actual production systems using machine learning and deep learning, I want to set expectations for deep learning and highlight opportunities that should not be ignored.

If you want the truth to stand clear before you, never be for or against. The struggle between “for” and “against” is the minds worst disease. – Seng-ts’an, c. 700 C. E.

In case you haven’t heard Deep Learning (aka neural networks) are on a comeback after the great winter of AI, thanks largely to the dropping cost of compute (i.e. GPUs) and easier development libraries (i.e. CUDA, Theano, Torch, Caffe, TensorFlow and DSSTNE). However, the biggest reason is the easy access to large volumes of data thanks to the internet and the labeled data collection platforms like Amazon’s Mechanical Turk.

One such dataset is put together by ImageNet. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is one of the biggest challenges in Computer Vision for state of the art in image recognition and understanding. New York Times wrote about it back in 2014 and when Baidu was banned from the competition for breaking competition rules. The challenge is to classify 1.28 million images belonging to 1,000 classes.

Enter Deep Learning

Deep Learning made a splash at ILSVRC in 2012, when Alex Krizhevsky, Ilya Sutskever & Geoffrey E. Hinton proposed a 5 layer neural network that outperformed any of the non-neural network approaches in the ImageNet. Their SuperVision entry based on a deep learning network (commonly referred to as AlexNet) won the competition with a 16.4% error rate, compared to the next best entry with an error rate of 26.2%. Since then Google, Facebook, Microsoft, Baidu, and others have aggressively researched into using deep learning. Last year Microsoft won the competition with an error rate of around 6% using a network with 152 layers.

In other applications of deep learning, Google saw a 49% drop in their speech recognition (i.e. transcription) errors using long-short-term memory deep recurrent neural network. Paypal uses deep learning for fraud detection and prevention (blog,  video).

A Cautionary Tale

Clearly, deep learning has been very successful in solving some of the most challenging problems in AI. While we must approach it with a healthy dose of skepticism, we have to acknowledge the successes and explore the possibilities. The problem often comes when people throw deep learning at a problem without thinking through the problem.

It is no surprise that Amazon uses deep learning for recommendations since they have open sourced the engine and blogged about it. But it was not always like that. One of the challenges the personalization team faced when exploring deep learning was that the initial prototypes gave the same or worse performance when compared to traditional machine learning approaches used in the field of recommendations.

My biggest contribution to that team’s effort was to model the problem the right way. In this particular case, the right way was not the traditional way recommender systems have been thought of.

post_dl_everywhere_modeling_the_right_way.jpg

Even though both algorithms (A and B) used deep learning, a similar sized network, structure and training parameters, the approach I proposed and demonstrated saw a 6x improvement in precision for the top recommended item. I cannot share the details behind the formulation since Amazon didn’t allow external publication of that work, but if you have access check out the video of talk I gave at the Amazon Machine Learning Conference in 2015 😉

Simply throwing the data (and compute) at deep learning is not a good idea. You have to model and solve the problem in a manner appropriate for that specific problem.

Promise of Deep Learning

Deep Learning’s biggest promise is actually in learning latent feature or representation learning, which makes the subsequent task of prediction easier. Getting the right features can make the learning and prediction part of the problem trivial. By far scientists spend most of their time in the manual engineering of the right features using domain knowledge, experience and intuition, supported by standard feature selection and projection algorithms.

In deep learning techniques, neural networks jointly optimize the feature engineering, feature selection, and modeling steps – all at the same time. This opens up the opportunity for us to skip manual feature engineering, and let the machine discover the relative importance and non-linear interaction between the signals as they propagate through the network layers. In some setups, such as autoencoders, the network can learn the important layers of features without any labels at all – i.e. in an unsupervised way. This means we can start applying machine learning to domains where we have large volumes of unlabelled data or where acquiring labels is difficult/expensive.

There is also a lot of interest in transfer learning, where features are first learnt in a domain with a large amount of labeled data or in an unsupervised way. Once learnt the features are then fine-tuned for another related domain using much smaller datasets. But the practical reality for the moment is the same – deep learning requires a lot of data and computation.

No Free Lunch!

When I was doing my Ph.D. at UNSW, I often chatted with Achim Hoffmann who wrote this interesting perspective on the limitation of machine learning [ post-script file ] published in the European Conference on Artificial Intelligence back in 1990. The key element for me was this.

The results indicate a rather general point. Namely, that for any amount of information which should get acquired, people have to do the complete work. One may choose between writing complex programs and providing a program with a huge amount of input data. In any case, the work cannot be reduced essentially. The machine can only do what it is told to do. And it cannot be told to generate information by itself. … The results do not mean, that machine learning is completely purposeless. But they clearly show that one cannot expect any magic from machine learning.

Even though 25 years have passed since this paper was written, the underlying idea is very relevant to set our expectations of machine learning and deep learning. We can throw data and compute at deep learning, but it cannot magically get us the answer. We still need human experts and scientists to figure out how to apply deep learning appropriately, not to mention push the research boundaries of what is capable of deep learning. I think deep learning is a very promising field for exploration and worth taking the risks in investing experimentation resources towards. We just need to be prepared to learn.

 

 

Deep Learning with DSSTNE

Recently I got a couple of EVGA GeForce GTX 1080 to keep my study nicely lit and warm when winter comes to Seattle. My interest in GPUs though is more for Deep Learning, than lighting and heating. Deep Learning is actively being explored for all kinds of machine learning applications since they offer a hope of automatic feature learning. In fact, a large number of Kaggle competition winners tend to rely on Deep Learning methods to avoid any kind of hand-crafted feature engineering. Considering how computationally expensive Deep Learning training tends to be, GPUs are essential for doing anything meaningful in a reasonable amount of time.

As part of my job with the Big Data & Analytics Platform team at Oracle, I come across customers that do need help with tackling some of these cutting-edge machine learning problems – from image understanding to speech recognition and even product recommendations. Part of the challenge is always simplifying the complexity and letting the people focus on what they need to do and hide away what is important but not necessary for immediate focus.

Siraj Rival ( @sirajology ) had posted a really nice video earlier this year on how to build a movie recommender system using 10 lines of C++ code and DSSTNE (pronounced “destiny“), a deep learning library that my old team at Amazon built and open-sourced earlier this year.

Aside: DSSTNE does automagic model parallelism across multiple GPUs and is also very fast on sparse datasets. Scott Le Grand ( @scottlegrand ) who was the main creator of DSSTNE has reported DSSTNE to be almost 15x faster than TensorFlow in some cases.

  • Disclosure: Scott and I used to work together at Amazon for the personalization team that built DSSTNE. We no longer work for Amazon, so cannot speak to how it is being used inside Amazon.
  • Update: Check out this talk by Scott talk on DSSTNE at Data Science Summit 2016 )

Back to Siraj’s movie recommender – although he does a great job, I think there are some very important points about the design of DSSTNE that are easily overlooked. DSSTNE has 3 important design elements:

  1. scale – to handle large datasets that won’t fit on a single GPU, and do that automatically.
  2. speed – for faster experimentation cycles, allowing the scientists to be more efficient and scale the number of experiments they run
  3. simplicity – for non-experts to experiment, deploy and manage deep learning solutions into production

In this post, I’ll show how to build a movie recommender writing NO lines of C++ code. DSSTNE is largely configured through a Neural Network Layer Definition Language and 3 binaries – generateNetCDF, train & predict. It uses a JSON based config file to describe the network, the functions, and parameters to use when training the model. This approach makes it much easier for people to run the hyper-parameter search across different network structures without needing to write a single line of C++ code.

So let us get started by installing CUDA and cuDNN on my Ubuntu 16.04.

CUDA & cuDNN

First the prerequisites for CUDA.

$ sudo add-apt-repository ppa:graphics-drivers/ppa
$ sudo apt-get update
$ sudo apt-get install nvidia-367
$ sudo apt-get install mesa-common-dev
$ sudo apt-get install freeglut3-dev

Download the local run binary from https://developer.nvidia.com/cuda-toolkit

Install the CUDA 8 library:

$ sudo ./cuda_8.0.27_linux.run --override

IMPORTANT: Make sure you DO NOT install the drivers included with the .run file. Keep others are defaults and yes for everything else.

Set environment variables:

$ export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
$ export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

At this point, you should be able to check that network cards are recognized by CUDA by running nvidia-smi.

$ nvidia-smi
Thu Aug 11 10:51:36 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35                 Driver Version: 367.35                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:03:00.0     Off |                  N/A |
|  0%   31C    P8     7W / 180W |      1MiB /  8113MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 0000:06:00.0      On |                  N/A |
|  0%   35C    P8     8W / 180W |    156MiB /  8110MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    1      4455    G   /usr/lib/xorg/Xorg                             106MiB |
|    1      5229    G   compiz                                          48MiB |
+-----------------------------------------------------------------------------+

To force CUDA to use the latest version of GCC edit the header file that drops you out of a build.

$ sudo nano /usr/local/cuda/include/host_config.h

Comment out the line which complains about the GCC version

//#error -- unsupported GNU version! gcc versions later than 5.3 are not supported!

Compile the samples:

$ cd ~/NVIDIA_CUDA-8.0_Samples
$ make

Some of the samples still fail, but I’ll look into them later.

Get the CUDNN library from https://developer.nvidia.com/cudnn and follow the instructions to install it.

Now for DSSTNE (Destiny)

First, you need to install the pre-requisites for DSSTNE. I’ve put together a shell script that runs the steps documented here.

Then, make sure you have the paths set up correctly. I had something like this in my .bashrc.

# Add CUDA to the path
# Could use /usr/local/cuda/bin:${PATH} instead of explicit cuda8
export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

# Add cuDNN library path
export LD_LIBRARY_PATH=/usr/local/cudnn-8.0/lib64:${LD_LIBRARY_PATH}

# Add OpenMPI to the path
export PATH=/usr/local/openmpi/bin:${PATH}

# Add the local libs to path as well
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib

Now to get, build and test DSSNTE:

$ git clone https://github.com/amznlabs/amazon-dsstne.git
$ cd amazon-dsstne/src/amazon/dsstne
$ make

This will build the binaries under amazon-dsstne/src/amazon/dsstne/bin for:

  • generateNetCDF –  converts CSV text files into NetCDF format used by DSSTNE
  • train – trains a network using input, output data and config file with network definition
  • predict – uses a pre-trained network to make predictions.

There is a nice example of training an auto-encoder based recommender for MovieLens 20m dataset that comes with the code.

Download the data in the CSV/Text File format. If you have your own dataset, make sure it conforms to this data format.

$ wget https://s3-us-west-2.amazonaws.com/amazon-dsstne-samples/data/ml20m-all

Convert the text data into NetCDF format data for network input and expected network output. It also builds up a features and samples index files. 

$ generateNetCDF -d gl_input -i ml20m-all -o gl_input.nc -f features_input -s samples_input -c
$ generateNetCDF -d gl_output -i ml20m-all -o gl_output.nc -f features_output -s samples_input -c

Train the network using the config for 30 epochs and batch size of 256. It will checkpoint and save the network every 10 epochs. Handy if you want to explore the network convergence by epochs.

$ train -c config.json -i gl_input.nc -o gl_output.nc -n gl.nc -b 256 -e 30

Once the training complete, you can use the network and GPU to make batch mode offline predictions in the original text format. The following command generated 10 movie recommendations for each user in ml20m-all file (i.e. -r ml-20all) into the recs file (-s recs). It also lets you mask or filter out movies that the user has already seen (-f ml20-all)

$ predict -b 256 -d gl -i features_input -o features_output -k 10 -n gl.nc -f ml20m-all -s recs -r ml20m-all

That’s it. While it’s training you can use nvidia-smi to see which GPU it is running on, and how much memory it uses.

$ nvidia-smi
Thu Aug 11 12:11:58 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35                 Driver Version: 367.35                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:03:00.0     Off |                  N/A |
|  0%   42C    P2    77W / 180W |    524MiB /  8113MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 0000:06:00.0      On |                  N/A |
|  0%   36C    P8     8W / 180W |    170MiB /  8110MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      5523    C   train                                          521MiB |
|    1      4455    G   /usr/lib/xorg/Xorg                             106MiB |
|    1      5229    G   compiz                                          60MiB |
+-----------------------------------------------------------------------------+

Look ma, no C++ code!

The number of input nodes is automatically inferred from the input data file.

The number of output nodes is automatically inferred from the expected output data file.

Everything else is defined in the config.json or command line flags ( batch size and the number of epochs for training). Neural Network Layer Definition Language describes all that DSSTNE supports.

{
 "Version" : 0.7,
 "Name" : "AE",
 "Kind" : "FeedForward", 
 "SparsenessPenalty" : {
 "p" : 0.5,
 "beta" : 2.0
 },

"ShuffleIndices" : false,

"Denoising" : {
 "p" : 0.2
 },

"ScaledMarginalCrossEntropy" : {
 "oneTarget" : 1.0,
 "zeroTarget" : 0.0,
 "oneScale" : 1.0,
 "zeroScale" : 1.0
 },
 "Layers" : [
 { "Name" : "Input", "Kind" : "Input", "N" : "auto", "DataSet" : "gl_input", "Sparse" : true }, 
 { "Name" : "Hidden", "Kind" : "Hidden", "Type" : "FullyConnected", "N" : 128, "Activation" : "Sigmoid", "Sparse" : true },
 { "Name" : "Output", "Kind" : "Output", "Type" : "FullyConnected", "DataSet" : "gl_output", "N" : "auto", "Activation" : "Sigmoid", "Sparse" : true }
 ],
 
 "ErrorFunction" : "ScaledMarginalCrossEntropy"
}

Applying deep learning techniques to a problem such as recommendation typically means lots of experimentation exploring the different mix of the:

  1. types of input data and output targets – purchase or browsing history, rating, product attributes such as category, cost, and color, attributes such as age or gender, etc.
  2. network structures – the number of layers, number of nodes per layer, connections between layers, etc.
  3.  network and training parameters – learning rates, denoising, drop-outs, activation functions, etc.

How you pose the problem and prepare the dataset (#1) is VERY important in applying deep learning. If you pose the machine learning problem incorrectly, not even deep learning and a cloud full of GPUs can help you there.

But, once you have that, figuring out the right network structure (#2) and training parameters (#3) can mean a difference between success and failure. That means running a lot of experiments or essentially a hyper-parameter search problem.

The JSON based config simplifies the hyper-parameter search problem. You can generate a large combination of these config files and try them out in parallel, quickly narrowing the options down to the configurations that are most suitable for that particular application.

Given this is still early days for Deep Learning, the speed and scale of experimentation has a huge bearing on what we learn about using Deep Learning. Plus I know my study will be warm for this coming winter.

Credits and References:

 

Big Data is also Big Compute

Machine learning and statistical modeling to get to the right model is a process of discovery. Analyst or scientist don’t know a priori the perfect algorithmic combinations that would yield the best possible model, even if the task or problem is well understood. It is an iterative and incremental process of exploration and discovery.

Typically a data scientist will start with an initial guess using their best judgment – a mix of industry best standards and tacit knowledge from personal experiences – to come up with algorithms for a machine learning pipeline:

  1. feature generation – which transforms raw data into features/signals for the model
  2. feature normalization or transformations – clean up, center or rescale
  3. feature selection – keeping the best mix of features battling the curse of dimensionality
  4. modeling algorithm – the actual model for regression or classification
  5. evaluation – on a held out test set to see how well the pipeline works

Then begins an iterative process of exploration, looking for improvements across the different combination of algorithms and parameter settings. Even on a small dataset, the number of possible combinations can grow dramatically.

For example, say if I had 6 different feature generation algorithms, 2 normalization options, 2 feature selection algorithms and 7 modeling algorithms, we have at least 6x2x2x7 = 168 combinations without even considering the possible hyper-parameters for each of the algorithms. Now imagine evaluating it across multiple data partitions for k-fold evaluation where k=10, it gives us 1,680 combinations. If we were working with time-series datasets where a model was trained weekly, we would evaluate the stability of the pipeline across each of the 52 weeks of the year across 2 years, giving us 17,472 combinations. A daily model evaluated the same time period would mean 122,640 combinations. Clearly, this quickly becomes a Big Compute problem.

This also an embarrassingly parallel problem and lends well to Spark/Hadoop environments. Even if the datasets are small, distributing the thousands of modeling combinations across a cluster of machines can dramatically speed up the time a scientist has to spend legitimately slacking off.

Recently, this is exactly what we did for a customer. My team at Oracle helps customer from all industries realize the value Big Data & Analytics platforms can bring to their organization, by engaging in pilots and proof-of-concepts. This PoC for a leading North American commodity producer focused on improving their price forecasting capabilities. A more accurate price prediction means better opportunity to make sell-vs-hold decisions. They wanted to use the data they had (weekly commodity price, daily international exchange rates, monthly economic data) and data that they didn’t have (hourly weather) to see if this would lead to more accurate predictions. Given the short 3-week sprint, our intent was to help their analysts become more efficient going forward – i.e. scale their capacity for experimentation.

testing_17k_models_using_spark

The figure above shows the results from evaluating each of the 17,472 combinations for a classification based approach that would simply predict if the price will go up or not in the following week, across a 2-year period. Each dot represents a single combination of a machine learning pipeline = [feature generation, feature normalization, feature selection, modeling algorithm, test set]. The color denotes the modeling algorithm for that run. Formulating the problem as a 2-class classification problem helps when dealing with a rather noisy target, by not trying to fit the exact price too closely and also a great way to data/modeling biases. A similar approach was then used to explore a further 22,464 combination of models that predicted the actual price.

The search found a better algorithm (in red below)  that predicted the commodity price within +/-5% of the actual price 73% of the time, compared to 40% for the algorithm the customer uses (in blue below).  The figure below shows the narrow range of error for the newer algorithm compared the existing one, which predicts prices over and under the actual price by up to 20%.

Price Error Bounds

This shotgun approach may not appeal to the machine learning purists, but it is a great way to quickly zero in on the set of combinations that consistently perform well and eliminate the combinations that added little or no value.

Big Data technologies such as Spark/Hadoop are also Big Compute technologies to scale the number of experiments that a scientist can run, making them more efficient, allowing them to explore wider and deeper than they could otherwise. In this particular case, it helped to identify a new algorithm to improve the accuracy of the price forecast, which has a direct impact on the bottom line of any commodity producer.