Chief Data Scientist at Reorg, a global provider of credit intelligence, data and analytics, and Adjunct at UVA’s School of Data Science.
Many new companies are hiring data scientists with the goal of leveraging their existing data to create business growth. I was hired five years ago to establish data science at Reorg. Throughout this time, I worked with many teams — business, commercial, technology, product and C-level. Through collaboration on various projects, I discovered the importance of educating non-technical stakeholders on the strengths and weaknesses of data science approaches. While data scientists are often seen as excellent problem solvers, they cannot solve business problems alone.
In this short article, I will share five key points for subject matter experts (SMEs) to keep in mind when scaling data science across an organization.
1. Patience is genius.
Conceptualizing, planning, building and executing a data science model is a recursive process. Even in cases where a large amount of training data is available, data scientists seldom get the best model on the first run. The first meeting with the data science team can be used to construct a 30,000-foot view of the problem to define the problem clearly and identify specific model goals. Getting familiar with the data, exploring features, selecting a methodology and understanding the limitations of the data often consumes data scientists’ attention on the first iteration.
It is important for SMEs to be prepared to spend time with the data science team throughout the model-building process to ensure the best outcome. This can involve providing additional training data or hosting a working session so that the data science team understands how the problem is solved currently and how model output will be used. It is also during subsequent iterations that SMEs can play a crucial role in educating the data science team about the nuances of the data and maintaining clarity regarding model objectives.
2. Knowledge is power.
Data scientists don’t know what you know. Your domain knowledge is crucial for building a successful data science model. Data science is meant to be good at handling the data, cleaning it and fitting a model. This does not necessarily require an understanding of the meaning of the data. However, knowing what the data represents and will be used for can provide a superior edge for a data scientist to build a strong model.
We spent several hours with domain experts such as financial analysts, legal analysts and covenant analysts to learn from them, understand the challenges they face and understand how to apply their knowledge. This empowered our team to build successful models that aren’t black boxes.
3. All models are wrong.
No data science model can reach 100% accuracy. This is the nature of a data science model. Additionally, no model can solve the entirety of a problem. Errors are guaranteed in the form of both false positives and false negatives. It is always helpful to acknowledge this at the beginning and be proactive in planning regarding how to elegantly handle errors or how to smoothly communicate errors with clients. Some options include having Mechanical Turks manually fix errors as they arise or adding a disclaimer to a client-facing platform about error percentage. That said, the best part you can play in handling this is analyzing the model output along with the data science team and determining an optimum threshold to balance false positive and false negative errors based on business needs.
4. Diminishing returns of scale.
Refinement takes time and effort. Refining a data science model from prototype or proof of concept performance to a final acceptable level of output requires repeated model retraining and fine-tuning. Through these iterations, the accuracy of the model gradually improves to a plateau. Because model performance inevitably plateaus at a certain level of accuracy that is always less than 100%, expectations about model potential need to be managed carefully.
Improvements in data science model accuracy tend to be gradual and nonlinear. For example, a prototype model may have 70% accuracy. After one week spent training and tuning the model, the accuracy then goes up to 80%. It then might take multiple weeks to go from 80% to 90% accuracy. At this point, greater time and effort are required to increase model accuracy, and it might take months to go from 90% to 92% accuracy. These diminishing returns can be thought of as a “cost-to-lift” ratio, or the ratio of investment in model building to the corresponding lift in model accuracy. Reaching the highest possible accuracy sometimes can even require rebuilding the model from scratch due to limitations in selected methods or data.
5. Outliers don’t lie.
In my opinion, outliers are the most fascinating data points, and they often reveal interesting facts about the data set as a whole. Though they do not fall within a normal range, they tell important truths about the extremities of the data and can offer insights into the way the data was collected, processed and stored.
Data science models are typically trained to identify prominent recurring patterns in data. It is common, however, for a data set to have outliers or anomalies. Chasing each of these outliers and attempting to fit them into the model can lead to overfitting of the model and endanger overall efficacy. While it is desirable to get as close to 100% accuracy as possible, it is important to reach a balance between underfitting and overfitting of the model. To deal with the inevitable edge cases and outliers present in the data, it is important to reach a middle ground or develop a strategy between SMEs and data scientists about how to handle them. One option is to develop a heuristic or rule that can be applied to model results to return more appropriate results for edge cases.
In conclusion, SMEs play a vital role in successful model development. Clear communication between SMEs and data scientists to correctly align business needs with data science capabilities can unlock the potential of your data science team.