AI for Startups: Solving Problems Using Machine Learning
Introduction
Startup companies bring some of the most innovative products and services to market yet often do so with minimal tools, manual processes, and staff who stretch the bounds of their expertise to get things done. Artificial intelligence (AI)—and in particular, machine learning (ML) and deep learning (DL)—are helping startups address these and other issues by automating processes and reducing workloads. In such “AI-later” startups, AI is not part of the product but is eventually implemented to help solve business problems, improve processes, and increase the potential business value.
Sometimes automation can be hand-engineered and built on expert knowledge of the human-led process. But when the volume, complexity, or variability of data available makes this method intractable, machine learning can be of real benefit. This article aims to provide a high-level overview of determining AI inputs and outputs, working with data sets, exploring data possibilities, and finalizing an AI model (Figure 1).
Figure 1: These steps highlight the process of using machine learning to solve common startup problems. (Source: Author)
Step 1: Identify Inputs, Outputs, and Metrics
One of the first steps in the process is deciding on the inputs and outputs for your algorithm as well as choosing the right metrics for measuring its performance. These decisions should be influenced by business goals and technical constraints. For example, the availability and volume of data, as well as privacy requirements can affect data inputs, as can the need for consistency in file formats and the need to store data.
In most cases, the data input will be fairly obvious—such as text, images, or numeric values, requiring minimal pre-processing before it can be used. However, pre-processing of the outcome data is likely needed to produce a single value to label each input data point. For example, a company might want to triage customer service complaints received via email, or it might want to suggest other items to purchase based on items in the buyer’s shopping cart. In either case, the outcome data needs adapting to label the urgency of those email complaints or to identify product codes that match product images.
Choosing the right metrics for measuring the success of a model can be done based on how important it is to achieve a certain accuracy on a particular outcome in the data (Figure 2). Although it can seem logical to aim for high accuracy, this might not always work. For example, in fraud detection, identifying potential fraud is more important than correctly predicting every event. As a rule of thumb, if you choose a metric favoring accuracy of one class of rarely-occurring events, many non-events will likely get flagged as well. In this case, to avoid compromising the ability to detect fraudulent transactions, including a human-in-the-loop (HitL) to finalize results can help.
Figure 2: Accuracy—the number of correct responses—relates to sensitivity and specificity of the metric. (Source: The Wordstream Blog)
In selecting metrics, it is worth investigating what others have done and recommended as well as starting the data aggregation and cleaning process. In some cases, the data is not yet in the state you need, or data collection can be adjusted to make it more “AI friendly.”
Step 2: Prepare the Data
In general, AI models expect data to be in a particular format at all times. This step entails cleaning and transforming data to meet the standards required by the AI model and goals, which can be both time-consuming and complex. Often a data engineer will be brought in to handle infrastructure, storage, and pipelines for data ingestion.
First, each input needs a corresponding label or target that you want to predict. For example, if you have 100 images of dogs, each image needs to be identified as a dog. This can be accomplished using simple methods such as using a CSV file or storing them in a separate folder called “dogs.” Almost all algorithms for classification will expect the prediction targets to be numeric as well, either binary or categorical. Yes or no outcomes are examples of binary labels, whereas many classes, such as dog, cat, or bird in object prediction, are examples of categorical labels. For predicting a value rather than a class, termed regression, the target must be normalized between 0 and 1. Yet more sophisticated AI methods require equally complex labels, but regardless, everything needs to be consistent and research into an appropriate data structure is important.
Additionally, data points need to be standardized. For images, this means they are at least the same size and not so big that they can’t be processed by an AI model. For text, this could mean truncating or padding phrases so that they’re the same length, or it could mean tokenizing phrases, which refers to replacing each word with a number. In this stage, it can be useful to think about various options for labels and data to ensure that, should the initial selections of inputs and outputs fail to yield anything meaningful or prove to be too noisy, the data can be used in other ways.
Finally, datasets should be cleaned, which refers to ensuring data is correct, consistent, and usable. This could include identifying and correcting corrupt, incomplete, duplicate, or irrelevant parts of the dataset. Data cleaning usually takes even more time than developing a new algorithm, so remember the 80-20 rule: 20 percent effort for 80 percent of the data. In the initial stages of the project, making data easily processable without yet worrying about having robust systems to clean every piece of data is acceptable.
Step 3: Explore the Data and Confirm Selections
Exploratory data analysis (EDA) aims to identify underlying patterns, spot anomalies, and check assumptions in the dataset. EDA can be completed as part of data preparation; however, it often follows data cleaning. Some of the most important tasks during the EDA include analyzing
- Data missingness, which can impact how your model performs. Depending on the percentage missing in required fields, it might be possible to drop these data points, perform an interpolation of values, or abandon the use of that piece of information if not enough is present to be useful.
- Outlier data: It is important to distinguish between outliers as noise or as actual events you want to capture. Compare, for example, erroneous values that are too high or too low to make sense against rarities such as fraud or machinery malfunctions where the data might also appear similar.
- Data label noise: Noise in the labels comes from incorrectly labeled data points and impedes the AI's ability to learn a proper correlation between data and target.
Depending on the volume of data, it might be possible to correct these errors, but sometimes you might need to choose a different option as a target for prediction.
Step 4: Research Algorithms and Prepare Resources
Next, it is time to work on the AI itself. Always conduct a preliminary investigation into available algorithms that might be suitable based on the task (Figure 3). With the number of resources including pre-trained models and research articles detailing algorithms for specific tasks, you can likely piggyback on an existing resource rather than reinvent the wheel.
Figure 3: Visual guide to types of algorithms/approaches for different machine learning goals. (Source: Toward Data Science)
Also, depending on the amount of data you have available per class, it can be worth deciding on a machine-learning (ML) algorithm versus a deep-learning (DL) one. Generally, deep learning works best with over 5,000 examples per class label. If you have fewer examples per class, the model might only learn your training data and not be able to predict outcomes correctly from new real-world information. ML was used for a long time before DL came along and yields very good results on smaller datasets; however, more handcrafting is needed for the data points—a process typically called feature engineering.
Depending on the size of your dataset and of each data point—remember, images of even 300 × 300 pixels will take a long time to train on!), you should invest in some compute power, either through the existing platforms or buying in-house graphics processing unit (GPUs). Normally, for first-time projects, the former is recommended since you can terminate access if the project does not work out. Because of the maturity and completeness of AI services available on platforms such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, it is possible to conduct this stage without a dedicated “AI person” or even any coding at all. How well these work for your specific task will influence whether you decide to hire additional expertise for your project.
Step 5: Benchmark, Iterate, and Finalize the Model
Regardless of the machine learning type, input type, or learning type, websites such as Model Zoo, Tensorflow Hub, Google Cloud Platform, or AWS are likely to have a pre-trained solution that has already learned to perform some prediction from data. Importantly, these models can be reused by fine-tuning to perform similar tasks—called transfer learning—such as using a model trained to predict objects in an image to predict only types of furniture given fewer data, even if it had not seen those items during the first round of training. Transfer learning is a very common way to get great performance gains while leveraging the work of others and not requiring such a wealth of data alone. Typically, working with these solutions requires basic programming skills in Python, but others are used as well.
You can also use simpler machine learning algorithms on a subset of your data features as a crude method of determining signal in the data. Once you have established how well out-of-the box methods work, the iteration process can begin. Whether you decide to improve on what you’ve tried or attempt a bespoke model instead will depend on the threshold required by your startup for accuracy at this task.
Step 6: Get Ready to Deliver!
This completes the typical AI project process. To recap, you will need to select inputs, outputs, and performance metrics, then get your data in order and complete an exploratory data analysis to confirm your selections in the first step. Afterward, the model development and iteration stage begins. Once you are happy with the model performance and it gives the results that you need, your startup can commence delivery to production to start reaping the benefits of this newfound automation capability.
Production is a process in itself and requires several steps and processes as well. Here, you will need to establish how to bridge the gap between model performance and required accuracy as mentioned in the data section here. Other considerations include robustifying the data-cleaning software and deciding on versioning processes or tools for datasets and models. Stay tuned for Part 2 for all the information you’ll need to deliver AI in production!