Hello, Think Big Analytics

A little over a month ago I left my role as the Chief Data Scientists for Big Data & Analytics Platform Team at Oracle. It was sad to say goodbye to some wonderfully talented people that I had the pleasure of working with, but change is an inevitable part of our lives. After enjoying a month off at my warmer and sunnier home in Sydney spent with family and friends, I feel energized about what is next.

I am humbled and excited about my new role as the Practice Director – Data Science & Analytics, Americas at Think Big Analytics. There are exciting developments in the world of artificial intelligence that makes it more important than ever for data scientists to understand the customer’s needs, reflect upon the wider context beyond those needs, and develop solutions that have a meaningful impact for the customer. I am looking forward to getting to know a talented team who is focused on the evolving needs of our customers and delivering impactful data science consulting services.

Advertisements

AI & ML – Lessons learnt and real-world challenges

Just before I flew back to Seattle, I gave a talk last week at my alma mater – School of Computer Science & Engineering at UNSW, Australia. It was great to see some familiar faces and meet some new ones that I hope feel more compelled to tackle some interesting problems in data science, machine learning (ML) and artificial intelligence (AI).

In this talk, I shared some the personal lessons that I learnt as part of building AI & ML solutions at companies like Amazon and Oracle. I also opened up about my fears of these technologies, as well as the challenges that the industry faces in delivering intelligent systems for the 99% (?) of businesses. You can find the slides from the talk (PDF) for the references and links that I mentioned. Just send an email to ( avishkar @ gmail dot com) with the subject “AI & ML” to get the password to the PDF.

The most important message that I wanted to impart to the room full of researchers, academics, and industry practitioners was how do we collectively address the shortage of skills needed to develop AI and ML solutions to the broad range of business problems beyond the top 1% of leading-edge tech companies. Education, standards and automated tools can help ensure a certain base level of competency in the application of AI & ML.AddressingSkillsShortage.jpg

The vast majority of the businesses out there are not Google, Amazon or Facebook, with deep pockets and years of R&D experience to tackle the challenge of applying AI and ML. Everyone from schools (i.e. universities) and industry responsible for growing this field must also develop standards and tools that ensure a certain level of quality is maintained for the solutions that we put into production. We have had standards when it comes to mechanical and civil engineering to ensure that things that can impact people’s lives and safety adhere to a certain quality standard. Similarly, we should also develop standards and encourage organizations to validate compliance with those standards when it comes to developing AI & ML solutions with far-reaching consequences.

BiasedDataBiasedModels.jpg

A simple and very personal example was that one of my own photos was rejected by the automated checks to verify that a passport photo complies with the requirements for visas. The fact that the slightly “browner” version of me (left) failed the check seems to suggest an inherent bias in the system due to the kind of data used to build the system. Funny but scary. How many other “brown” people have had their photos rejected by such a system?

Other examples would be Human Resource systems that identify potential candidates, suggests no-/hire decisions or recommends salary packages to new hires. If the system is trained on historical data and uses gender as a feature, is it possible that the system could be biased against women for high-profile or senior positions? Afterall historically women have been under-representative in senior positions. Standards and compliance verification tools can help us identify such biases, ensuring that data and models do not introduce biases that are unacceptable in a modern and equitable society.

Academics, researchers, and industry practitioners cannot absolve themselves of the duty of care and consideration when developing systems that have a broad social impact. Data scientists must think beyond the accuracy metric and the whole ecosystem in which the system operates.

Image Credit:

  • Modeling API by H Alberto Gongora from the Noun Project
  • education by Rockicon from the Noun Project
  • tools by Aleksandr Vector from the Noun Project
  • Checklist by Ralf Schmitzer from the Noun Project

Plant Science Initiative @ NC State University

In my role at Oracle, I get to work across many industries on some very interesting problems. One that I have been involved with recently is the collaboration between North Carolina (NC) State University and Oracle with NC State’s Plant Science Initiative.

In particular, we’ve been working with the College of Agriculture and Life Sciences (CALS) to launch a big data project that focuses on sweet potatoes. The goal is to help geneticists, plant scientists, farmers and industry partners in the sweet potato industry to develop better varieties of sweet potatoes, as well as speed up the pace with which research is commercialized. The big question is can we use the power of Big Data, Machine Learning, and Cloud computing to reduce the time it takes to develop and commercialize a new variety of sweet potato crop from 10 years to three or four years?

One of the well-known secrets to driving innovation is scaling and speeding up experimentation cycles. In addition, reducing the friction associated with collaborative research and development can help bring research to market more quickly.

My team is helping the CALS group to develop engagement models that facilitate interdisciplinary collaboration using the Oracle Cloud. Consider geneticists, plant science researchers, farmers, packers, and distributors of sweet potato being able to contribute their data and insights to optimize different aspects of the sweet potato production – sweet potato from the genetic sequence to the dinner plate.

I am extremely excited by the potential impact open collaboration between various stakeholders can mean for the sweet potato and precision agriculture industry.

More details at cals.ncsu.edu

It is a go for Amazon Go!

The super secret exciting project that I spent days and nights slogging over when I was at Amazon has finally been announced – Amazon Go. A checkout-less, cashier-less magical shopping experience in a physical store. Check out the video to get a sense of the shopping experience that simplifies the CX around the shopping experience. Walk in, pick up what you need and walk out. No line, no waiting, no registers.

I’m very proud of an awesome team of scientists & engineers covering software, hardware, electrical and optics that rallied together to build an awesome solution of machine learning, computer vision, deep learning and sensor fusion. The project was an exercise in iterative experimentation and continually learning, refining all aspects of the hardware, software as well as innovative vision algorithms. I personally was involved in 5 different prototypes and the winning solutions that ticked all the boxes more than 2 years ago.

I remember watching Jeff Bezos and the senior leadership at Amazon, playing with the system by picking and returning the items back to the shelves. Smiles and high-fives all around as the products were added and removed from the shopper’s virtual cart, with the correct quantity of each item.

Needless to say there is a significant effort after the initial R&D is done to move something like this to production, so it is not surprising that it has taken 2 years since then to get it ready for public. Well done to my friends at Amazon for getting the engineering solution over the line to an actual store launch for early 2017.

Photo Credit: Original Image by USDA – Flickr

 

Precision Agriculture needs Scalable Automation

If you have ever looked out an airplane window as it flies over land, chances are that you see some spectacular landscapes, sprawling cities or a quilted patchwork of farms. Over the centuries science, machines and better land management practices have increased agricultural outputs dramatically, allowing farmers to manage and cultivate ever larger swaths of lands. The era of Big Data and Artificial Intelligence is pushing these productivity gains even further with Precision Agriculture. For example, using satellite images to evaluate the health of crops to direct farming decisions or predict the likely yields we can expect during harvest time.

Earlier this year, my team at Oracle (chiefly Venu Mantha, Marta Maclean & Ashok Holla) worked with a large agricultural customer to help them shift towards a more data-driven agricultural approach that would maximize their yields and reduce waste. We explored a variety of technologies, from field sensors streaming measurements over Internet-of-Things (IoT) networks to the geospatial fusion of a variety of historical and real-time data. The idea is to support farmers with contextually relevant data, allowing them to make better decisions. Apart from people and process challenges associated with such a dramatic business transformation for an ancient sector, the major technological obstacle for realizing the potential of precision agriculture is, in fact, scalable automation.

farms_small.jpg

Let us take a concrete example. A key aspect of the proposal was the use of aerial or satellite imagery to assess the health of the crop. Acquiring satellite or aerial imagery on demand is significantly easier compared to what it was just a few years ago with the growing number of vendors in the market and falling cost of acquisition (e.g. Digital Globe,  Free Data Sources). Now that we can get imagery in high-resolution (i.e. down to level of an individual tree), that is multi-spectral (i.e. color and infrared bands) and covers large expanses of land (i.e. 100 acres or more) the challenge has shifted to a well-recognized one in the world of Big Data – How do you sift through the large volumes of data to extract meaningful and actionable insights quickly?

If one image covers 100 acres and takes a Geographic Information System (GIS) specialist an hour to review manually, handling images for over 100,000 acres would mean a team of 40 GIS specialists working without a break for about 25 hours to go through the full batch of images. Clearly throwing more people at the problem is not going to work. Not only is it slow and error-prone, but finding enough specialists with domain knowledge would be a challenge.

The answer is to automate the image analysis pipelines and distributed computing to parallelize and speed up the analysis. Oracle’s Big Data Spatial & Graph (BDSG) is particularly well suited for partitioning, analyzing and stitching back large image blocks using the map-reduce framework of Hadoop. It understands common GIS and image file formats and gives the developer Java bindings to the OpenCV image processing library as part of its multimedia analytics capabilities. You can either split up a large image (raster or vector) and analyze each chunk in parallel or analyze each image in parallel. You can write your own image processing algorithms or compose one using the fundamental image processing algorithms available in OpenCV.

alignment_process.png

The challenge for the customer, however, was coming up with an algorithm that could correct the image misalignment that naturally creeps in during image acquisition or image stitching process. A misaligned image would require a GIS specialist to open and manually adjust the image using tools like ArcGIS from ESRI. Analyzing a misaligned image would lead to incorrect results and can lead to bad decisions.

This is where the BDSG product engineering team (Siva Ravada, Juan Carlos Reyes & Zazhil ha Herena) and I stepped in to design and develop a solution to automate the image alignment and analysis processes. We have a patent application around the solution that can be used in a variety of domains beyond farming – Think of urban planning, defense, law & enforcement, and even traffic reports.

Just the manual alignment of images would take a GIS expert 3-8 mins per image. With our solution, the entire alignment and analysis process takes less than 90 seconds and can handle 100s of images in parallel. Instead of a team of 40 GIS experts working without a break for 25 hours, we can now analyze imagery covering 100,000 acres in about 15 minutes. 

The key lesson here is that although we can access interesting sensors and data sources to inform us and guide Precision Agriculture, successful technological solutions require scalable automation that minimizes the human effort, not add to it. The adoption of these solutions in practice further depends on the maturity of the organization in embracing change.

Want to know more or talk about how the Oracle Big Data & Analytics team can help your business objectives? Connect with me on LinkedIn and follow me on Twitter

Links:

 

What’s the (big) deal with AlphaGo?

In March of 2016, Google’s AlphaGo beat the world champion Lee Sedol at the game of Go, a feat hailed as an important milestone for Artificial Intelligence (AI). It was also a big deal with Deep Learning and Reinforcement Learning. But what was the big deal?

Let’s start with a simple game of Noughts and Crosses, also known as Tic-Tac-Toe. A game played on a 3×3 grid by 2 players placing O (noughts) and X (crosses)  in turns with the objective of getting 3 noughts or crosses in a row.

Source: Wikipedia

Naive counting leads to 19,683 possible board layouts (39 since each of the nine spaces can be X, O or blank), and 362,880 (i.e., 9!) possible games (different sequences for placing the Xs and Os on the board). – Wikipedia

Now we (i.e. humans) play the game without enumerating all possible board layouts or exploring all possible games. However, that is how computers are typically programmed to play the game. After each move, computers would generate the tree of moves, where each branch represents the sequence of moves. The computer generates the tree to a certain depth and then identify the ‘branches’ most likely to lead to a victory, and selects that as its next move. The process is repeated after the other players make a move, until the computer or human wins. This brute-force search and prune strategy are fundamentally how Deep Blue beat Garry Kasporav in 1997, and the game of chess has about 1043 number of legal positions.

Now let us look at the game of Go.

There are 1,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,

000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,

000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,

000,000,000,000,000,000 possible positions—that’s more than the number of atoms in the universe, and more than a googol times larger than chess. – Google Blog 

With 10171 possible positions in Go (I counted the zeros), a brute-force search used by Deep Blue to beat Kasparov in chess, just won’t work with Go. So AlphaGo had to use intuition when selecting between moves, a bit like the way humans do. 

In this context, intuition means working with limited information and taking a shortcut in selecting only a tiny subset of options to arrive at a good move. The idea is similar to the idea of “thin-slicing” that Malcolm Gladwell discusses in his book Blink [ kindle ]. Intuition is a function of experience and practice – leading to both desirable efficiencies and undesirable biases into our daily decision making, but that is a different post for another day. 

During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue did in its chess match against Kasparov4; compensating by selecting those positions more intelligently, using the policy network, and evaluating them more precisely, using the value network—an approach that is perhaps closer to how humans play. Furthermore, while Deep Blue relied on a handcrafted evaluation function, the neural networks of AlphaGo are trained directly from gameplay purely through general-purpose supervised and reinforcement learning methods. – Nature Paper

This means, that given data for supervised learning (i.e. deep learning) and time to practice (i.e. reinforcement learning) AlphaGo could continue to get better and better, and not rely on human experts to come up with heuristics or evaluation functions.

This was an extremely promising sign for Artificial Intelligence (AI) and getting us one tiny tiny (did I mention tiny?) step towards general purpose AI. Deep learning is used to reduce the feature engineering that would typically go in setting up a machine learning algorithm. Reinforcement learning mimics the idea of practice or experimentation that we as humans use in learning how to play tennis, play an instrument or make a drawing.

Learning to mimic intuition, without explicit feature engineering with deep learning and practicing it via reinforcement learning offers a very interesting template for teaching machines how to deal with kind of problems that humans excel at with their intuition. It also represents a way for machines to operate on large search space problems (i.e. traveling salesman problems, scheduling, and optimizations) and get reasonably good solutions given a fair trade-off of time and compute resources. This is why AlphaGo is a big deal.

Credits & References:

Oracle’s Big Data & Analytics Platform for Data Scientists

I work for Oracle, helping businesses realize the potential of data science, big data and machine learning to grow their revenues, minimize their costs and expand new opportunities to leap-frog their competition. Which means working with some amazing folks from different parts of businesses and across industries. Invariably I’m asked – So, what does Oracle offer in the space of data science and machine learning on Big Data?

Lets leave aside the machine learning and optimization solutions embedded within different Oracle products, and just focus on the platform pieces for Big Data and Analytics for today. Lets also ignore for the moment the data management questions around security, encryption and integration that are important and chiefly the concerns of the IT department. Lets focus only on what it offers for data analysts and data scientists.

Oracle’s Big Data & Analytics Platform enables data science and machine learning at scale by taking the best that open-source offers, putting it together as an engineered solution and adding capabilities and features where open-source falls short.

Oracle Big Data Cloud Service (BDCS) is essentially Hadoop/Spark in a “Box” (or rather a number of dedicated cloud based machines connected with a 40Gb/sec InfiniBand fabric making network IO between cluster nodes very fast). It runs Cloudera Enterprise version of Hadoop with engineered hardware optimized for speeding up the analytics. Analysts can use Python, R, Scala and Java for data manipulation, analytics and machine learning using open source libraries such as SparkML. Python users such as myself can use open-source libraries (e.g. numpy, scipy, pandas, scikit-learn, seaborn, folium) inside Jupyter notebooks via PySpark kernel for operating on distributed datasets.

Out of the box, R users don’t get all the benefits of SparkML, so Oracle R Advanced Analytics for Hadoop (ORAAH) addresses that gap, giving R users access to SparkML implementations of machine learning algorithms. In addition ORAAH’s own implementation of Linear Regression, Generalized Linear Models and Neural Networks are faster and more efficient than the open-source implementations within SparkML. In experiments run by Marcos Arancibia‘s team, ORAAH’s LM model training was 6x-32x faster than SparkMLlib. Similarly GLM models trained by ORAAH were 4x -15x faster than SparkMLlib. More importantly, ORAAH continues to scale  linearly despite memory constraints, where as SparkMLlib just fails.

oraah_vs_spark_ml.png

ORAAH is available for on-premise (BDA) and the cloud (BDCS).

But not everyone can code or should have to code to transform and explore data in Hadoop. Oracle Big Data Discovery (BDD) provides “citizen data-scientists” and data analysts with interactive way to find, transform and visually discover patterns or relationship within the data stored in Hadoop. It works by keeping a sample of the data in Hadoop in-memory, automatically generating graphs that describe the shape of that attribute, and allows users to interactively manipulate that data.

Once the analyst is comfortable with the transformations, he or she can apply them to the full dataset with a click of a button. It is a very nice tool for data analysts and data scientists alike in preparing a dataset before switching to Jupyter or RStudio to use the distributed machine learning algorithms in Spark or ORAAH.

Data isn’t always in a tabular form, nor does it make sense to analyze it that way. The Spatial component of Oracle Big Data Spatial & Graph (BDSG)  scales up the analysis of images and geo-spatial data using Hadoop and OpenCV with a Java interface. Just last week I finalized a patent application on a method to automate the alignment and analysis of aerial and satellite imagery to known structures, that I had prototyped earlier this year. For one potential customer wanting to scale their operations to cover 100,000 acres of agricultural properties no longer requires them to hire a team of 40 GIS specialists and making them work round to clock to keep up with the volume of imagery expected each week.

The Graph component of BDSG provides an in-memory graph engine and algorithms for fast property graph analysis. The in-memory graph engine can handle 20-30 billion edge graphs on a single node, scale out to multiple nodes as it expands beyond the limits of a single node, and perform 10-50x faster than other graph engines for finding communities, optimal paths and even product recommendations.

Analysts that have been using Oracle Advance Analytics (OAA) as part of the Oracle Databases to train machine learning models within the database or using R, can continue to use the same interface while bringing in data from Hadoop or NoSQL Databases via Oracle Big Data SQL. Big Data SQL pushes the predicate (i.e. query processing and filtering) to Hadoop or NoSQL Databases and pulls across only the smaller filtered dataset to the relational Database. This allows analysts to user SQL, Oracle Data Miner or R, while manipulating and joining datasets in Hadoop and Database.

Once the analysis is done, now comes time to tell the story. Oracle Data Visualization (DV) is an interactive data visualization and presentation platform as a desktop application or a cloud service, letting the business intelligence, analysts and scientists reveal the story hidden within the data visually.

There are also a number of things that have been announced at Oracle Open World 2016, and coming soon. One of the most exciting for data science is the Big Data Cloud Service – Compute Edition (BDCS-CE). It is an on-demand elastic compute Hadoop/Spark cluster, allowing data scientists to spin up clusters as needed, scaling it up as needed and tear them down afterwards. For an analysts perspective, it is a perfect environment to sandbox ad-hoc queries and experimentation, before operationalizing these as analytics pipelines. There is also the Event Hub Cloud Service that provides a Kafka-based streaming data platform.

Want to know more or talk about how the Oracle Big Data & Analytics team can help your business objectives? Connect with me on LinkedIn and follow me on Twitter