Machine learning for Java developers, Part 2: Deploying your machine learning model


My previous tutorial, “Machine Learning for Java developers,” introduced setting up a machine learning algorithm and developing a prediction function in Java. I demonstrated the inner workings of a machine learning algorithm and walked through the process of developing and training a machine learning model. This tutorial picks up where that one left off. I’ll show you how to set up a machine learning data pipeline, introduce a step-by-step process for taking your machine learning model from development into production, and briefly discuss technologies for deploying a trained machine learning model in a Java-based production environment.

Requirements and what to expect from this tutorial

Deploying a machine learning model is a separate endeavor from developing one, often implemented by a different team. Developing a machine learning model requires understanding the underlying data and having a good grasp of mathematics and statistics. Deploying a machine learning model in production is typically a job for someone with software engineering and operations experience.

This tutorial shows you how to make a machine learning model available in a highly scalable production environment. I assume you have some development experience and a basic understanding of machine learning models and algorithms; otherwise, you may want to start by reading “Machine learning for Java developers, Part 1.”

I’ll start with a brief refresher on supervised learning, including an example application that I’ll use to demonstrate how to train, deploy, and process a machine learning model for use in production.

Supervised machine learning: A refresher

I’ll use a simple, supervised machine learning model to illustrate the machine learning deployment process. The example machine learning model shown in Figure 1 can be used to predict the expected sale price of a house.

jw grothml2 fig1 Gregor Roth

Figure 1. Trained supervised machine learning model for sale price prediction

Recall that a machine learning model is a function with internal, learnable parameters that map inputs to outputs. In the above diagram, a linear regression function, hθ(x), is used to predict the sale price for a house based on a variety of features. The x variables of the function represent the input data. The θ (theta) variables represents the internal, learnable model parameters.

To predict the sale price of a house, you must first create an input data array of x variables. This array contains features such as the size of the lot or the number of rooms in a house. This array is called the feature vector.

Because most machine learning functions require a numerical representation of features, you will likely have to perform some data transformations in order to build a feature vector. For instance, a feature specifying the location of the garage could include labels such as “attached to home” or “built-in,” which have to be mapped to numerical values. When you execute the house-price prediction, the machine learning function will be applied with this input feature vector as well as the internal, trained model parameters. The function’s output is the estimated house price. This output is called a label.

Training the model

Internal, learnable model parameters (θ) are the part of the model that is learned from training data. The learnable parameters will be set during the training process. A supervised machine learning model like the one shown below has to be trained in order to make useful predictions.

jw grothml2 fig2 Gregor Roth

Figure 2. An untrained supervised machine learning model

Typically, the training process starts with an untrained model where all the learnable parameters are set with an initial value such as zero. The model consumes data about various house features along with real house prices. Gradually, it identifies correlations between houses features and house prices, as well as the weight of these relationships. The model adjusts its internal, learnable model parameters and uses them to make predictions.

jw grothml2 fig3 Gregor Roth

Figure 3. A trained supervised machine learning model

After the training process, the model will be able to estimate the sale price of a house by assessing its features.

Machine learning algorithms in Java code

The HousePriceModel provides two methods. One method implements the learning algorithm to train (or fit) the model. The other method is used for predictions.

jw grothml2 fig4 Gregor Roth

Figure 4. Two methods in a machine learning model

The fit() method

The fit() method is used to train the model. It consumes house features as well as house-sale prices as input parameters but returns nothing. The fit() method requires the correct “answer” to be able to adjust its internal model parameters. Using housing listings paired with sale prices, the learning algorithm looks for patterns in the training data. From these, it produces model parameters that generalize from those patterns. As the input data becomes more accurate, the model’s internal parameters will be adjusted.

Listing 1. The fit() method is used to train a machine learning model

// load training data
// ...

// e.g. [MSSubClass=60.0, LotFrontage=65.0, ..., MSSubClass=20.0, ...]
List<Map<String, Double>> houses = ...;

// e.g. [208500.0, 181500.0, 223500.0, 140000.0, 250000.0, ...]
List<Double> prices = ...;

// create and train the model
var model = new HousePriceModel();, prices);

Note that the house features are double typed in the code. This is because the machine learning algorithm used to implemented the fit() method requires numbers as input. All house features must be represented numerically so that they can be used as x parameters in the linear regression formula, as shown here:

hθ(x) = θ0 * x0 + … + θn * xn

The trained house price prediction model could look like what you see below:

price = -490130.8527 * 1 + -241.0244 * MSSubClass + -143.716  * LotFrontage +  … * …

Here, the input house features such as MSSubClas or LotFrontage are represented as x variables. The learnable model parameters (θ) are set with values like -490130.8527 or -241.0244, which have been gained during the training process.

This example uses a simple machine learning algorithm, which requires just a few model parameters. A more complex algorithm, such as for a deep neural network, could require millions of model parameters; that is one of the main reasons why the process of training such algorithms requires high computation power.

The predict() method

Once you have finished training the model, you can use the predict() method to determine the estimated sale price of a house. This method consumes data about house features and produces an estimated sale price. In practice, an agent of a real estate company could enter features such as the size of a lot (lot-area), the number of rooms, or the overall house quality in order to receive an estimated sale price for a given house.

Transforming non-numeric values

You will often be faced with datasets that contain non-numeric values. For instance, the Ames Housing dataset used for the Kaggle House Prices competition includes both numeric and textual listings of house features:

jw grothml2 fig5 Gregor Roth.

Figure 5. A sample from the Kaggle House Prices dataset

To make things more complicated, the Kaggle dataset also includes empty values (marked NA), which cannot be processed by the linear regression algorithm shown in Listing 1.

Real-world data records are often incomplete, inconsistent, lacking in desired behaviors or trends, and may contain errors. This typically occurs in cases where the input data has been joined using different sources. Input data must be converted into a clean data set before being fed into a model.

In the sample above, you would need to replace the missing (NA) numeric LotFrontage value. You would also need to replace textual values such as MSZoning “RL” or “RM” with numeric values. These transformations are necessary to convert the raw data into a syntactically correct format that can be processed by your model.

Once you’ve converted your data to a generally readable format, you may still need to make additional changes to improve the quality of input data. For instance, you might remove values not following the general trend of the data, or place infrequently occurring categories into a single umbrella category.

How to build your machine learning data pipeline

Often, the data preparation or preprocessing steps are arranged as a pipeline. For instance, the simplified house prediction pipeline below arranges a set of preprocessing transformer components with a final house prediction model.

jw grothml2 fig6 Gregor Roth

Figure 6. Example steps of data pre-processing

The transformer components clean the raw data and transform it into a format the model is able to consume. The data becomes more suitable for the model after each stage in the transformation.

The pipeline pattern allows you to organize your transformation code so that each transformer component has a single responsibility. For instance, the CategoryToNumberTransformer class below replaces all textual feature values with numeric ones. Because this transformer implementation does not handle null values, the transformer has to be processed after applying an AddMissingValuesTransformer. Internally, the CategoryToNumberTransformer holds a map using textual feature values as the key, and unique, generated numbers as values. The mapping of the MSZoning feature might look as follows:

FV=1, RH=2, RM=3, C=5, …, RL=8, «default»=-1

When you call the transform() method, textual values will be detected and transformed into numbers using the mapping collection, as shown in Listing 2.

Listing 2. Replace textual feature values with numeric ones

public class CategoryToNumberTransformer implements Transformer<Object, Double, Double> 
   private final CategoryToNumberResolver categoryToNumber = new CategoryToNumberResolver();

   public List<Map<String, Double>> transform(List<Map<String, Object>> houses) 

   private Map<String, Double> transform(Map<String, Object> house) 
      return house.entrySet()
                  .collect(Collectors.toMap(feature -> feature.getKey(),
                                            feature -> (feature.getValue() instanceof String)
                                                          : (Double) feature.getValue()));

   public void fit(List<Map<String, Object>> houses , List<Double> prices) 
       houses.forEach(house -> house.entrySet()
                                    .filter(feature -> feature.getValue() instanceof String)

   private static final class CategoryToNumberResolver 
      private final Map<String, Double> categoryToNumberMapping = Maps.newHashMap();

      void add(Map.Entry<String, Object> feature) 
         // ..

      Double map(Map.Entry<String, Object> feature) 
         // ..

There are two ways to create the internal category-to-number map. To do it manually, you would add all possible entries to the map during development time. To do it dynamically, as shown above, you would scan all the available records at training time. In this example, the fit() training method dynamically builds the category-to-number map. First it extracts a set of all textual values, then it uses the value set to build a map, which includes the newly generated numbers that are associated to the unique textual values.

Configuring the machine learning data pipeline

In most cases, preprocessing logic is specific to the model, so updating the logic of the preprocessing components requires re-training the model. For this reason, the preprocessing code and the model code are often packaged together, as shown below. Here, a generic Pipeline class is used to arrange the transformers together with a final house prediction model.

Listing 3. A generic Pipeline class

var pipeline = Pipeline.add(new DropNumericOutliners("LotArea", 10))
                          .add(new AddMissingValuesTransformer())
                          .add(new CategoryToNumberTransformer())
                          .add(new AddComputedFeatureTransformer())
                          .add(new DropUnnecessaryFeatureTransformer("YrSold", "YearRemodAdd"))
                          .add(new HousePriceModel());, prices);
   // …

Some machine learning libraries provide pipeline abstractions similar to the example above. Others provides configurable and customizable preprocessing components only.

Training the machine learning data pipeline





Page 2


Calling the pipeline fit() method as shown above trains all of the included transformers and the final model. Typically, the required raw training dataset is provided by a data acquisition component. This component collects data from a variety of sources and prepares the data for ingestion into the machine learning pipeline. For instance, the Housedata Ingestion component shown below encapsulates data sourcing and produces raw house and price data records, which are fed into the estimation pipeline.

jw grothml2 fig7 Gregor Roth

Figure 7. A flow diagram of the machine learning data pipeline

Internally, the Housedata Ingestion component may access a database storing sales transactions as well as other data sources such as a database storing geographical area data. Using an ingestion component separates the machine learning pipeline from the data source, so that changes in the data source will not impact the pipeline.

During the development process, different versions or variants of the pipeline may be trained and evaluated. For example, you might apply different thresholds to gradually weed outliers from the data. Working with machine learning data pipelines is a highly iterative process; it is common to test many pipeline versions or variants during development, eventually selecting the most consistently accurate pipeline for production usage.

Machine learning models in production

When you deploy the selected trained pipeline in production, you will be faced with new requirements. In order to manage production requirements such as reliability or maintainability, the packaging and deploying processes have to be reproducible. You should be able to re-package or re-deploy the pipeline with no change to its behaviors, even if the training data changes. You also have to be able to test or to rollback to older trained pipeline versions in case of erroneous system behavior in production.

Ensure your pipeline is reproducible

Ensuring that your machine learning pipeline is reproducible is easier said than done. Over time, your training dataset will change. It may increase in size as it gains more labeled data records, or it may decrease as data becomes unavailable due to external factors. Even if you use the exact same pipeline code, changes to your training dataset will produce different settings of the internal learnable pipeline parameters.

As an example, say you add a house record with a new MSZoning category, “A,” which was not in the older dataset. In this case, although the transformation code is untouched, the internal CategoryToNumberTransformer map will include an additional entry for this new, unseen category:

FV=1, RH=2, RM=3, C=5, …, RL=8, A=9, «default»=-1

The newly trained pipeline’s behavior differs from its previous iteration.

Use version control

To support reproducibility, pipeline code as well as trained instances must be under strict version control. According to a traditional software development process, the data ingestion should be versioned, released, and uploaded into a repository along with the untrained and trained pipeline components. Typically, you would use a build system such as Maven. In this case, we could store the results of the build-and-release process, the component binaries such as ingest-housedata-2.2.3.jar, and pipeline-estimate-houseprice-1.0.3.jar in a repository like JFrog’s Artifactory or Sonatype’s Nexus repository.

CI/CD in the machine learning data pipeline

Machine learning data pipelines and CI/CD pipelines are not the same. A machine learning data pipeline controls the data flow to transform input data into output data or predictions. A CI/CD pipeline is used to build, integrate, deliver, and deploy software artefacts in different stages. The diagram below illustrates the difference in the two types of pipeline.

jw grothml2 fig8 Gregor Roth

Figure 8. Data pipeline vs. CI/CD pipeline

If we wanted to integrate CI/CD into the machine learning data pipeline, we could build our JAR files artefacts during the CI/CD development stage. We could also extend the CI/CD pipeline to trigger the training process and provide the trained, serialized pipeline, which could then be deployed into the production environment.

As shown in Listing 4, the appropriate version of the ingestion and pipeline components would be loaded from the repository to train a production-ready pipeline. In this example, the downloaded executable JAR files contain the compiled Java classes, as well as a main class. When you execute ingest.jar, internally the Ames Housing dataset will be loaded and the raw house and price records files will be produced.

Listing 4. A script to train and upload a machine learning data pipeline in a CI/CD context


# define the pipeline version to train

echo task 1: copying ingestion jar to local dir
curl -s -L $ingest_app_uri --output ingestion.jar

echo task 2: copying pipeline jar to local dir
curl -s -L $pipeline_app_uri --output pipeline.jar

echo task 3: performing ingestion jar to produce houses.json and prices.json. Internally will be fetched
java -jar ingestion.jar train.csv houses.json prices.json

echo task 4: performing pipeline jar to create and train a pipeline consuming houses.json and prices.json
version_with_timestamp=$version-$(date +%s)
java -jar pipeline.jar houses.json prices.json $pipeline_instance

echo task 5: uploading trained pipeline
echo curl -X PUT --data-binary "@$pipeline_instance" "$groupId//.///$artifactId//.///$version_with_timestamp/$trained"

Note that most shops use a platform like Gitlab CI, TravisCI, CircleCI, Jenkins, or GoCD for CI/CD. All of these tools use a custom DSL (domain-specific language) to define CI/CD tasks. To keep the code examples simple, I’ve used bash scripts instead of tool-specific CI/CD task definitions for the code in Listing 4. When using a CI/CD platform, you would typically embed a stripped version of the example code within the CI/CD tasks.

After performing the ingest step shown in Listing 4 (task 3), the raw dataset files are used by the executable pipeline.jar to produce a trained pipeline instance. Internally, the pipeline’s HousePricePipelineBuilder main class creates a new instance of the estimation pipeline. The newly created instance will be trained and serialized into an output file like pipeline-estimate-houseprice-1.0.3-1568611516.ser. This file contains the serialized state of the pipeline instance as a byte sequence and the names of the used Java classes.

To support reproducibility, the output filename includes the component version ID and a training timestamp. A new timestamp is generated for each training run. As a last step, the serialized trained pipeline file will be uploaded into a model repository.

Listing 5. Helper class to train a house price prediction pipeline

public class HousePricePipelineBuilder 

   public static void main(String[] args) throws IOException 
      new HousePricePipelineBuilder().train(args[0], args[1], args[2]);

   public void train(String housesFilename, String pricesFilename, String instanceFilename) throws IOException 
      var houses = List.of(new ObjectMapper().readValue(new File(housesFilename), Map[].class));
      var prices = List.of(new ObjectMapper().readValue(new File(pricesFilename), Double[].class));

      var pipeline = newPipeline();, prices); File(instanceFilename));

 public Pipeline<Object, Double> newPipeline() 
   return Pipeline.add(new DropNumericOutliners("LotArea", 10))
         .add(new AddMissingValuesTransformer())
         .add(new CategoryToNumberTransformer())
         .add(new AddComputedFeatureTransformer())
         .add(new DropUnnecessaryFeatureTransformer("YrSold", "YearRemodAdd"))
         .add(new HousePriceModel());

Deployment: REST and Docker in the machine learning data pipeline

In order to make your newly trained pipeline instance available to end users and other systems, you will have to make it available in a production environment. How you integrate the trained pipeline into the production environment will strongly depend on your target infrastructure, which could be a datacenter, an IoT device, a mobile device, etc.

As one example, integrating the pipeline into a classic batch-oriented big data production environment requires providing a batch interface to train machine learning models and perform predictions. In a batch-oriented approach you would process your data in bulk using shared databases or filesystems like Hadoop.

In most cases, a pipeline can be trained offline, so a batch-oriented approach is often used for this purpose. For example, I used the batch-oriented approach for the HousePricePipelineBuilder, where input files are read from the filesystem. The downside of this approach is the time delay. In batch processing, data records are collected over a period of time and then processed together, all at once.

In contrast to training, processing a trained pipeline in production often requires a more real-time approach. Processing incoming data as it arrives means that predictions will be available immediately, without delay. To support real-time requirements, you could extend a big data infrastructure like Hadoop with a messaging or streaming platform like Apache Kafka. In this case, the pipeline would have to be connected to the streaming system and listen for incoming records.

Machine learning with REST

An alternative to streaming or messaging would be to use an RPC-based infrastructure. Instead of consuming incoming records from a stream, in this case the pipeline listens for incoming remote calls such as HTTP requests. The machine learning pipeline will be accessed via a REST interface, as shown in the example below. Here, a minimal REST service handles incoming HTTP requests and uses the trained pipeline instance to perform predictions and send the HTTP response message. The trained pipeline instance will be loaded during the REST service’s initialization procedure. To be able to deserialize the pipeline, its classes have to be available in the REST service’s classpath.

Listing 6. A REST interface for the machine learning pipeline

public class RestfulEstimator 
   private final Estimator estimator;

   RestfulEstimator(@Value("$filename") String pipelineInstanceFilename) throws IOException  
      this.estimator = Pipeline.load(new ClassPathResource(pipelineInstanceFilename).getInputStream());

   @RequestMapping(value = "/predictions", method = RequestMethod.POST)
   public List<Object> batchPredict(@RequestBody ArrayList<HashMap<String, Object>> records) 
      return estimator.predict(records);

   public static void main(String[] args), args);

Typically, all artifacts required to run the server are packaged within a server JAR file. A server JAR file such as a server-pipeline-estimate-houseprice-1.0.3-1568611516.jar could include the pipeline-estimate-houseprice-1.0.3.jar, the serialized pipeline pipeline-estimate-houseprice-1.0.3-1568611516.ser, and all required third-party libraries.

To build such an executable server jar file, you could use a CI/CD pipeline as shown in Listing 7. The simplified bash script clones the source code of the generic REST service and adds the dependencies of the Houseprice pipeline, as well as the serialized, trained pipeline file. In this case, the Maven build tool is used to compile and package the code. Maven resolves and merges the party library dependencies of the generic REST server and the Houseprice pipeline during the build, making it easier to detect and avoid version conflicts between the generic REST server code and the pipeline code.

Note that the bash script below includes an additional step after providing the executable server JAR file. Note that a Docker container image is built in task 6. The script provides an executable server JAR file as well as a Docker container image.





Page 3

Listing 7. Bash script to build a RESTful machine learning data pipeline


mkdir build
cd build

echo task 1: copying framework-rest source to local dir
git clone --quiet -b
cd ml_deploy/module-pipeline-rest

echo task 2: download trained pipeline to pipeline-rest/src/main/resources dir
mkdir src/main/resources
curl -s -L $pipeline_instance_uri --output src/main/resources/$pipeline_instance
echo "filename: $pipeline_instance" > src/main/resources/application.yml

echo task 3: adding the pipeline artefact id to framework-rest pom.xml file
new_pom=$pom/"<!-- PLACEHOLDER -->"/$additional_dependency
echo $new_pom > pom.xml

echo task 4: build rest server jar including the specific pipeline artifacts
mvn -q clean install package

echo task 5: copying the newly created jar file into the root of the build dir
cp target/pipeline-rest-1.0.3.jar ../../$server_jar
cd ../../..

echo task 6: build docker image
docker build --build-arg arg_server_jar=$server_jar -t $groupId"/"$artifactId":"$version"-"$timestamp .

rm -rf build

Machine learning with Docker containers

Although the newly created executable server JAR is a deployable and runnable artifact, devops and system administrators often prefer Docker containers over executable JARs. Essentially, a Docker container can be seen as a customized software stack including a virtual operating system running on the top of a host operating system. This allows you to package up an application with all of its required parts, system components, and configurations. In contrast to traditional virtual machine solutions, a Docker container uses the same kernel as the host system that it runs on, which reduces the overhead of virtualization.

jw grothml2 fig9 Gregor Roth

Figure 9. Running the Docker container image

For instance, you could create a Docker container image including a slim Debian Linux distribution, the newest OpenJDK runtime, as well as your executable server JAR. In contrast to a JAR-based deployment, Docker makes it easy to implement a customized configuration such as specific JVM garbage collector settings or to install custom certificates as part of your deployment unit. Instead of delivering an executable JAR with a more or less large list of installation prerequisite, you would provide a self-contained Docker container without having to install anything else.

To assemble a new Docker container image, you have to define a DOCKERFILE containing a collection of Docker commands instructing Docker as to how it should build your image. In the example below, a new Docker image will be built based on an OpenJDK/buster base image, including the Debian Linux distribution and OpenJDK 13. With the exception of the last command, all commands will be executed at the Docker image build time. Essentially, the DOCKERFILE copies the server JAR file located in the build directory into the container’s file system. Assuming the Docker container has been started, the last command will run the Java-based REST service.

Listing 8. DOCKERFILE to build the machine learning data pipeline

FROM openjdk:13-jdk-slim-buster

# build time params (provided by 'docker build --build-arg arg_server_jar=server-pipeline-estimate-houseprice-1.0.3-1568611516.jar')
ARG arg_server_jar

# copy the executable server jar file into the docker container
COPY build/$server_jar /etc/restserver/$server_jar

# copy the build time param to a runtime param (required for runtime command CMD below)
ENV server_jar=$arg_server_jar

# default command, which will be executed on runtime by performing 'docker run'
CMD java -jar /etc/restserver/$server_jar

By executing the image build process, Docker will read the DOCKERFILE in the local directory. In the example below a Docker container image will be created and tagged with a unique Docker identifier: By default, the newly created Docker image is stored into your local Docker environment.

docker build --build-arg arg_server_jar= server-pipeline-estimate-houseprice-1.0.3-1568611516.jar -t 

The docker run command will be used as shown below, to load the image and start the container.

docker run -p 9090:8080

In most cases, additional environment parameters such as the -p parameter will be set. The -p parameter is used to make the 8080 port of the Java server inside the Docker container available to services outside of the container. In this example, port 9090 of the host system will be mapped to Docker’s internal Java server port 8080.

Additional parameters limit the resource consumption of the Docker container. For instance, the -m parameter limits the container’s access to memory. Typically, such resource limiting parameters will be used to implement a Bulkhead stability pattern. The Bulkhead pattern helps to protect systems against cascading errors. For instance, a buggy Java server inside the container may start to consume more and more memory and CPU power. If the consumption is limited by using Docker’s resource parameters, other containers running on the same host will not be negatively affected by running out of memory or waiting for CPU time.


This tutorial has introduced a generalized process for training, deploying, and processing a machine learning model in a production environment. In practice, numerous requirements and conditions will weigh on the approach you use to put a machine learning model into production. Depending on your business requirements, machine learning models may have to be executed using a real-time solution such as a streams-based architecture, or a batch-oriented architecture that prioritizes throughput for heavy data loads. Additional factors to consider are the communication patterns, which may favor a database/filesystem-based pipeline API, a streams-based pipeline API, or a REST-based pipeline API. The whole pipeline may be packaged as a single deployment unit, or parts of the preprocessing components may be packaged as dedicated deployment units. Furthermore, the pipeline may be deployed as a self-contained Docker container, or you could use a central model repository, serving nodes to load and process models in a dynamic way like TensorFlow Serving does.

In contrast to traditional software development, all of these approaches require that you handle an additional dimension of complexity. In traditional programming, you hardcode the behavior of the program. In a machine learning pipeline, you also write code, but the code you write will be trained and adjusted based on production data, which adapts the behavior of the program. In contrast to traditional programming the unit of deployment is a trained frozen instance, which makes deployment and software maintenance more complex. Key to handling this additional dimension of complexity is to make things reproducible. With comprehensive version and release management, you will be able to re-train and re-deploy a pipeline instance, such that given the same raw data as input it will return the exact same output. The gives you the ability to deploy and run your machine learning pipelines in production environments in a controllable, transparent, and maintainable manner.

This story, “Machine learning for Java developers, Part 2: Deploying your machine learning model” was originally published by


Share this post if you enjoyed! 🙂


Source link

Leave a Reply

Your email address will not be published. Required fields are marked *