Machine-Learning Software Simplifies Development
(Source: TippaPatt/Shutterstock.com)
Among the different types of AI, machine learning (ML) has gained most recognition thanks to a growing list of successful applications. As noted [elsewhere in this issue], ML development flips the conventional model of software development: Rather than explicitly writing algorithms to process data, ML developers use data to train algorithms how to process data. For production applications of machine learning, developers may spend little time on the algorithms themselves and focus more on data engineering and writing code with proven algorithms. In contrast, ML researchers may spend most of their time on writing code for new algorithms or optimizing existing ones, using standard data sets to compare improvements over earlier algorithms. This article addresses key development resources needed to program ML applications.
Both production and research efforts can take advantage of a wide array of development resources, ranging from low-level algebraic libraries used to implement new kinds of model algorithms to high-level automated machine-learning environments that accept a set of data and return a trained model. In general, developers of production applications can complete their work with little need to involve themselves with low-level math libraries. Yet, when facing challenges such as developing ML models for Internet of Things (IoT) devices, production developers may still find themselves using some of the same sort of tools and techniques employed by researchers.
Whether focused on research or production, ML projects require implementation of an existing or novel ML algorithm using conventional coding methods. Here, ML developers work with a variety of conventional programming languages including Python, C/C++, Java, Javascript, R, Go, and other more specialized languages. Among these, Python has emerged as the dominant language for ML development partly because developers can quickly become productive with this language; but largely because of the wide availability of add-on libraries, or modules. If no suitable module is available, Python provides different methods for developers to add external C/C++ functions or even create Python modules in C. In general, however, ML development with Python builds on a common set of tested, optimized modules.
Building on modules
Developers can quickly get started with ML development by importing a set of Python modules that provide fundamental capabilities required equally for developing production ML models or for creating new ML algorithms. Among these, some of the more commonly used modules include:
- NumPy, which provides array manipulation and algebraic functions commonly required in ML development;
- Scipy, which provides a variety of scientific computing functions;
- Pandas, which supports high-level data structures and supports access to different file formats and databases;
- Matplotlib, which provides functions to visualize data and results.
In principle, a developer could proceed with development using just these libraries, implementing an ML algorithm’s underlying math operations using NumPy algebraic functions and visualizing results with Matplotlib. In practice, however, both researchers and production developers combine these libraries with several others. ML scientists exploring new algorithms might use the SymPy symbolic computing module to evaluate their equations or implement compute-intensive core functions in C using low-level routines from a basic linear algebra subprograms (BLAS) library such as OpenBLAS.
As discussed below, production developers may find themselves turning to C/C++ libraries for performance reasons. In the early stages of development, however, they are more likely to use Python modules that support higher level abstractions with intuitive functions designed specifically for implementing ML applications. Although this is perhaps the largest group of software resources for ML programming, some of the more commonly used machine-learning packages include:
- Scikit-learn, which natively supports perhaps the widest range of ML algorithms for supervised learning and unsupervised learning (Figure 1) with an accessible approach considered particularly effective for those new to ML development;
- Keras, which supports efficient implementation of deep neural network (DNN) models including convolutional neural networks (CNNs) through a comprehensive set of functions required to implement the various layers of a model;
- TensorFlow, which provides functions for model implementation as well as broader, end-to-end support for ML applications.
- PyTorch, which also provides both model implementation and end-to-end development capabilities.
Figure 1: Scikit-learn simplifies development of machine-learning programs using a broad array of algorithms for supervised and unsupervised learning. (Source: Wikipedia)
Each of these libraries abstracts complex operations to a series of intuitive function calls. To build a DNN model, for example, developers typically build up the model layer by layer using built-in functions that implement the layer’s function. After the model is configured, other function calls invoke training with hyperparameters needed in the training process itself.
As suggested earlier, some Python libraries, including TensorFlow and PyTorch, are supported by comprehensive ecosystems, so the core library is part of a more substantial framework for ML development. Although many such frameworks have emerged, TensorFlow and PyTorch have gained dominance among production developers and researchers, respectively. Researchers have generally preferred PyTorch because of interactivity and flexibility; Industry developers have generally preferred TensorFlow for its performance efficiency. Still, each framework continues to evolve, addressing any shortcomings with capabilities that drive them closer to parity.
An even higher level class of ML development resources continues to emerge from commercial cloud service providers such as Amazon Web Services (AWS), Google, IBM, and Microsoft as well as specialty cloud platform providers. Intended to provide turnkey machine-learning solutions, services such as AWS SageMaker, Google Fluid Annotation, IBM Cloud Annotations and Microsoft Automated ML generate models from datasets for users with neither the time nor the expertise to create ML models on their own. Typically, users can pass the results to other tools in each provider’s environment to create optimized inference models for deployment.
Optimization and deployment
Performance concerns are endemic to ML development project. While ML researchers continue to explore methods to speed lengthy training cycles, both researchers and production developers typically take advantage of the performance boost provided by graphics processing units (GPUs) and GPU-compatible libraries. For example, the GPU-enabled CuPy package can speed many core ML operations well over 100x compared to the compatible but non-GPU-enabled NumPy package.
For an overall gain in performance, developers can use the Numba compiler, which converts Python to machine code with optimizations including GPU support. TensorFlow’s XLA (Accelerated Linear Algebra) compiler can improve model speed and size with no changes in source code.
Alternatively, developers can use different versions of Python itself. Cython compiles Python-compatible Cython code, resulting in faster execution than possible with standard Python’s interpreter. Intel’s distribution for Python takes full advantage of performance enhancements available in Intel architectures.
For deployment on resource-constrained IoT devices, developers can take advantage of resource-optimized model architectures and processor-optimized libraries. For example, Google’s MobileNet CNN architecture and its more recent EfficientNet CNN architecture achieve high accuracy with smaller, faster models. To speed execution of the model itself, developers can use libraries such as Intel’s oneAPI Deep Neural Network Library or Arm’s NN (neural network) Software Developer Kit (SDK) for Cortex-A-based processors or its Cortex Microcontroller Software Interface Standard Neural Network (CMSIS-NN) library for Cortex-M-based processors.
Development environments
The previous discussion describes only a bare-bones set of Python modules among the thousands available in the Python Package Index repository just for machine learning. A typical development project will of course build on many module packages, each with their own dependencies. Developers typically use virtual workspaces to isolate a project’s set of development packages from different versions of common packages used in other projects or even in their operating environment. The Anaconda platform provides an even simpler approach, combining package management with simple deployment of virtual workspaces.
For both experienced ML developers and for those just venturing into ML development, the combination of Anaconda and a popular AI development tool, JupyterLab, largely eliminates the setup and configuration tasks typically required to use any development environment. JupyterLab like its earlier version, Jupyter Notebook, lets users build notebooks that combine descriptive text with runnable code and results in a single package. Jupyter notebooks have emerged as a common medium of exchange of ideas, specific algorithms and applications between developers, researchers and even participants in ML competitions and courses on Kaggle and other sites.
Conclusion
ML development encompasses a wide set of activities focused on both preparing data and writing code to implement models with existing or new algorithms. To implement models, developers need only a few basic tools to get started, but generating optimized inference models may require them to reach deeper into the rich set of tools available for creating effective ML-based applications.