Impactful Data Scientists

In 2012, Davenport and Patil’s article in Harvard Business Review titled Data Scientist: The Sexiest Job of the 21st Century, raised the profile of a profession that had been naturally evolving in the modern computing era – an era where data and computing resources are more abundantly and cheaply available than ever before. There was also a shift in our industry leaders adopting a more open and evidence-based approach to guiding the growth of their business. Brilliant data scientists with machine learning and artificial intelligence expertise are invaluable in supporting this new normal.

While there are different opinions on what defines a data scientist, as the leader of the Data Science Practice at Think Big Analytics, the consulting arm of Teradata, I expect data scientist on my team to embody specific characteristics. This expectation is founded on a simple question – Are you having a measurable and meaningful impact on the business outcome?

Any data scientist can dig into data, use statistical techniques to find insights and make recommendations for their business partners to consider. A good data scientist makes sure that the business adopts those insights and recommendations by focusing on the problems that are important to the company and making a compelling case grounded in business value. An impactful data scientist can iterate quickly, address a wide variety of business problems for the organization and deliver meaningful business impact swiftly by using automation and getting their insights integrated into production systems. Consequently, impactful data scientists more often answer ‘yes‘ to the question above.

So what makes a Data Scientist impactful? In my experience, they possess skillsets that I broadly characterize as that of a scientist, a programmer, and an effective communicator. Let us look at each of these in turn.

what_is_a_data_scientist_2.png

Firstly they are a scientist. Data scientists work in highly ambiguous situations and operate on the edge of uncertainty. Not only are they trying to answer the question, they often have to determine what is the question in the first place. They have to ask vital questions to the understand the context quickly, identify the root of the problem that is worth solving, research and explore the myriad of possible approaches and most of all manage the risk and impact of failure. If you are a scientist or have undertaken research projects, you would recognize these as traits of a scientist immediately.

In addition, data scientists are also programmers. Traditional mathematicians, statistician, and analysts who are comfortable using GUI-driven analytical workbenches that allow them to import data and build models with a few clicks often contest this expectation. They argue that they don’t need computer science skills since they are supported by (a) team of data engineers to find and cleanse their data, and (b) software engineers to take their models and operationalize them by re-writing them for the production environment. However, what happens when data engineers are busy, or the sprint backlog of IT department means the model that a data scientist has just found to make a company millions won’t make it to production for the next 6-9 months? They wait, and their amazing insights have no impact on the business.

Programming and computer science skills are essential for data scientists so that they are not ‘blocked’ by organizational constraints. A data scientist shouldn’t have to wait for someone else to find and wrangle the data they need, nor be afraid of getting their hands dirty with the code to ensure their models make it to production. It also means, data scientist do not become a bottleneck to their organization by automating their solutions for production or automatic reports. Given the highly distributed and large volume transactions in online, mobile and IoT applications means data scientists need to consider the design of their solution for scale. For example, will their real-time personalization model scale to the 100,000 requests per second for their company’s website and mobile app?

Finally, a data scientist should be an effective 2-way communicator. Not only should they empathize to understand the business context and customer needs, but also convey the value of their work in a manner that appeals to them. One of the hardest skill to master for some knowledgeable data scientists is often the ability to influence organizations without authority. A data scientist that goes around asserting that everyone should listen to them because he or she has data and insights without cultivating trust is likely to earn them the title of a prima donna and not achieve the impact that they can with those insights. Effective communication is relatable, precise and concise.

Data scientists with these three broad skillsets are in an excellent position to have a meaningful and measurable impact on the business outcomes, making them highly valuable to any organization. Of course, this list doesn’t talk about innate abilities like creativity, bias for action and a sense of ownership. Neither does it consider the organizational culture that may either support or hider their impact. I have focused on skills that can be developed through training and practice. In fact, these are essential elements to the growth and career paths for my team of brilliant and impactful data scientists at Think Big Analytics. 

Credits:

Plant Science Initiative @ NC State University

In my role at Oracle, I get to work across many industries on some very interesting problems. One that I have been involved with recently is the collaboration between North Carolina (NC) State University and Oracle with NC State’s Plant Science Initiative.

In particular, we’ve been working with the College of Agriculture and Life Sciences (CALS) to launch a big data project that focuses on sweet potatoes. The goal is to help geneticists, plant scientists, farmers and industry partners in the sweet potato industry to develop better varieties of sweet potatoes, as well as speed up the pace with which research is commercialized. The big question is can we use the power of Big Data, Machine Learning, and Cloud computing to reduce the time it takes to develop and commercialize a new variety of sweet potato crop from 10 years to three or four years?

One of the well-known secrets to driving innovation is scaling and speeding up experimentation cycles. In addition, reducing the friction associated with collaborative research and development can help bring research to market more quickly.

My team is helping the CALS group to develop engagement models that facilitate interdisciplinary collaboration using the Oracle Cloud. Consider geneticists, plant science researchers, farmers, packers, and distributors of sweet potato being able to contribute their data and insights to optimize different aspects of the sweet potato production – sweet potato from the genetic sequence to the dinner plate.

I am extremely excited by the potential impact open collaboration between various stakeholders can mean for the sweet potato and precision agriculture industry.

More details at cals.ncsu.edu

Oracle’s Big Data & Analytics Platform for Data Scientists

I work for Oracle, helping businesses realize the potential of data science, big data and machine learning to grow their revenues, minimize their costs and expand new opportunities to leap-frog their competition. Which means working with some amazing folks from different parts of businesses and across industries. Invariably I’m asked – So, what does Oracle offer in the space of data science and machine learning on Big Data?

Lets leave aside the machine learning and optimization solutions embedded within different Oracle products, and just focus on the platform pieces for Big Data and Analytics for today. Lets also ignore for the moment the data management questions around security, encryption and integration that are important and chiefly the concerns of the IT department. Lets focus only on what it offers for data analysts and data scientists.

Oracle’s Big Data & Analytics Platform enables data science and machine learning at scale by taking the best that open-source offers, putting it together as an engineered solution and adding capabilities and features where open-source falls short.

Oracle Big Data Cloud Service (BDCS) is essentially Hadoop/Spark in a “Box” (or rather a number of dedicated cloud based machines connected with a 40Gb/sec InfiniBand fabric making network IO between cluster nodes very fast). It runs Cloudera Enterprise version of Hadoop with engineered hardware optimized for speeding up the analytics. Analysts can use Python, R, Scala and Java for data manipulation, analytics and machine learning using open source libraries such as SparkML. Python users such as myself can use open-source libraries (e.g. numpy, scipy, pandas, scikit-learn, seaborn, folium) inside Jupyter notebooks via PySpark kernel for operating on distributed datasets.

Out of the box, R users don’t get all the benefits of SparkML, so Oracle R Advanced Analytics for Hadoop (ORAAH) addresses that gap, giving R users access to SparkML implementations of machine learning algorithms. In addition ORAAH’s own implementation of Linear Regression, Generalized Linear Models and Neural Networks are faster and more efficient than the open-source implementations within SparkML. In experiments run by Marcos Arancibia‘s team, ORAAH’s LM model training was 6x-32x faster than SparkMLlib. Similarly GLM models trained by ORAAH were 4x -15x faster than SparkMLlib. More importantly, ORAAH continues to scale  linearly despite memory constraints, where as SparkMLlib just fails.

oraah_vs_spark_ml.png

ORAAH is available for on-premise (BDA) and the cloud (BDCS).

But not everyone can code or should have to code to transform and explore data in Hadoop. Oracle Big Data Discovery (BDD) provides “citizen data-scientists” and data analysts with interactive way to find, transform and visually discover patterns or relationship within the data stored in Hadoop. It works by keeping a sample of the data in Hadoop in-memory, automatically generating graphs that describe the shape of that attribute, and allows users to interactively manipulate that data.

Once the analyst is comfortable with the transformations, he or she can apply them to the full dataset with a click of a button. It is a very nice tool for data analysts and data scientists alike in preparing a dataset before switching to Jupyter or RStudio to use the distributed machine learning algorithms in Spark or ORAAH.

Data isn’t always in a tabular form, nor does it make sense to analyze it that way. The Spatial component of Oracle Big Data Spatial & Graph (BDSG)  scales up the analysis of images and geo-spatial data using Hadoop and OpenCV with a Java interface. Just last week I finalized a patent application on a method to automate the alignment and analysis of aerial and satellite imagery to known structures, that I had prototyped earlier this year. For one potential customer wanting to scale their operations to cover 100,000 acres of agricultural properties no longer requires them to hire a team of 40 GIS specialists and making them work round to clock to keep up with the volume of imagery expected each week.

The Graph component of BDSG provides an in-memory graph engine and algorithms for fast property graph analysis. The in-memory graph engine can handle 20-30 billion edge graphs on a single node, scale out to multiple nodes as it expands beyond the limits of a single node, and perform 10-50x faster than other graph engines for finding communities, optimal paths and even product recommendations.

Analysts that have been using Oracle Advance Analytics (OAA) as part of the Oracle Databases to train machine learning models within the database or using R, can continue to use the same interface while bringing in data from Hadoop or NoSQL Databases via Oracle Big Data SQL. Big Data SQL pushes the predicate (i.e. query processing and filtering) to Hadoop or NoSQL Databases and pulls across only the smaller filtered dataset to the relational Database. This allows analysts to user SQL, Oracle Data Miner or R, while manipulating and joining datasets in Hadoop and Database.

Once the analysis is done, now comes time to tell the story. Oracle Data Visualization (DV) is an interactive data visualization and presentation platform as a desktop application or a cloud service, letting the business intelligence, analysts and scientists reveal the story hidden within the data visually.

There are also a number of things that have been announced at Oracle Open World 2016, and coming soon. One of the most exciting for data science is the Big Data Cloud Service – Compute Edition (BDCS-CE). It is an on-demand elastic compute Hadoop/Spark cluster, allowing data scientists to spin up clusters as needed, scaling it up as needed and tear them down afterwards. For an analysts perspective, it is a perfect environment to sandbox ad-hoc queries and experimentation, before operationalizing these as analytics pipelines. There is also the Event Hub Cloud Service that provides a Kafka-based streaming data platform.

Want to know more or talk about how the Oracle Big Data & Analytics team can help your business objectives? Connect with me on LinkedIn and follow me on Twitter