From H2O to POJO Models

Getting started with a minimal example

A while ago, I was experimenting with h2o and wanted to generate a Plain Old Java Object (POJO) model. I found the documentation useful but I decided to write a post with a simple example for future reference.

In this post, we will see how to:

  • build a simple h2o model in R.
  • convert the model to POJO.
  • create a main program in Java to use the POJO model.
  • compile and run the program.

Initialize h2o

First of all, we need to load h2o library and connect to an h2o instance.

## load packages
library(h2o)

## initialize h2o
h2o.init()

Import data

Once we make sure you are connected to h2o, we will import the iris dataset, which we are going to use in our example, using h2o.importFile().

## import data
iris_path = system.file("extdata", "iris.csv", package = "h2o")

iris_hex = h2o.importFile(path = iris_path, destination_frame = "iris_hex")

We can check the returned dataframe, which has four columns corresponding to the flowers features and a the last column corresponds to the class/label.

## display dataframe
iris_hex
   C1  C2  C3  C4          C5
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa

[150 rows x 5 columns] 

Build h2o model

For demo purposes, we will create a simple k-means model to cluster the entries in the iris dataset.

## create kmeans model with 3 clusters
iris_km_model = h2o.kmeans(training_frame = iris_hex,
                           x = c("C1", "C2", "C3", "C4"),
                           k = 3,
                           model_id = "iris_km_model",
                           seed = 1000)

You can print some info about the model as follows:

## print model summary
iris_km_model
Model Details:
==============

H2OClusteringModel: kmeans
Model ID:  iris_km_model 
Model Summary: 
  number_of_rows number_of_clusters number_of_categorical_columns
1            150                  3                             0
  number_of_iterations within_cluster_sum_of_squares total_sum_of_squares
1                    6                     140.02859            596.00000
  between_cluster_sum_of_squares
1                      455.97141


H2OClusteringMetrics: kmeans
** Reported on training data. **


Total Within SS:  140.0286
Between SS:  455.9714
Total SS:  596 
Centroid Statistics: 
  centroid     size within_cluster_sum_of_squares
1        1 52.00000                      42.99326
2        2 48.00000                      48.87702
3        3 50.00000                      48.15831

Convert h2o model to POJO

To convert your model to POJO, you can simply use h2o.download_pojo() with a path to save the java file.

## download pojo model (replace "saved_models" with your path)
h2o.download_pojo(iris_km_model, path = here::here("saved_models"))

A java file, entitles iris_km_model.java, will be created in the given directory.

You can expand the following section to see the code corresponding to the POJO model in this file.

Click to expand

/*
  Licensed under the Apache License, Version 2.0
    http://www.apache.org/licenses/LICENSE-2.0.html

  AUTOGENERATED BY H2O at 2019-07-13T17:22:02.421+02:00
  3.22.1.1
  
  Standalone prediction code with sample test data for KMeansModel named iris_km_model

  How to download, compile and execute:
      mkdir tmpdir
      cd tmpdir
      curl http:/localhost/127.0.0.1:54321/3/h2o-genmodel.jar > h2o-genmodel.jar
      curl http:/localhost/127.0.0.1:54321/3/Models.java/iris_km_model > iris_km_model.java
      javac -cp h2o-genmodel.jar -J-Xmx2g -J-XX:MaxPermSize=128m iris_km_model.java

     (Note:  Try java argument -XX:+PrintCompilation to show runtime JIT compiler behavior.)
*/
import java.util.Map;
import hex.genmodel.GenModel;
import hex.genmodel.annotations.ModelPojo;
import hex.genmodel.IClusteringModel;

@ModelPojo(name="iris_km_model", algorithm="kmeans")
public class iris_km_model extends GenModel implements IClusteringModel {
  public hex.ModelCategory getModelCategory() { return hex.ModelCategory.Clustering; }

  // Names of columns used by model.
  public static final String[] NAMES = NamesHolder_iris_km_model.VALUES;

  // Column domains. The last array contains domain of response column.
  public static final String[][] DOMAINS = new String[][] {
    /* C1 */ null,
    /* C2 */ null,
    /* C3 */ null,
    /* C4 */ null
  };

  public iris_km_model() { super(NAMES,DOMAINS,null); }
  public String getUUID() { return Long.toString(2946031392675382139L); }

  // Pass in data in a double[], pre-aligned to the Model's requirements.
  // Jam predictions into the preds[] array; preds[0] is reserved for the
  // main prediction (class for classifiers or value for regression),
  // and remaining columns hold a probability distribution for classifiers.
  public final double[] score0( double[] data, double[] preds ) {
    Kmeans_preprocessData(data,iris_km_model_MEANS.VALUES,iris_km_model_MULTS.VALUES,iris_km_model_MODES.VALUES);
    preds[0] = KMeans_closest(iris_km_model_CENTERS.VALUES, data, DOMAINS); 
    return preds;
  }

  // Pass in data in a double[], in a same way as to the score0 function.
  // Cluster distances will be stored into the distances[] array. Function
  // will return the closest cluster. This way the caller can avoid to call
  // score0(..) to retrieve the cluster where the data point belongs.
  public final int distances( double[] data, double[] distances ) {
    Kmeans_preprocessData(data,iris_km_model_MEANS.VALUES,iris_km_model_MULTS.VALUES,iris_km_model_MODES.VALUES);
    int cluster = KMeans_distances(iris_km_model_CENTERS.VALUES, data, DOMAINS, distances); 
    return cluster;
  }

  // Returns number of cluster used by this model.
  public final int getNumClusters() {
    int nclusters = iris_km_model_CENTERS.VALUES.length;
    return nclusters;
  }
}
// The class representing training column names
class NamesHolder_iris_km_model implements java.io.Serializable {
  public static final String[] VALUES = new String[4];
  static {
    NamesHolder_iris_km_model_0.fill(VALUES);
  }
  static final class NamesHolder_iris_km_model_0 implements java.io.Serializable {
    static final void fill(String[] sa) {
      sa[0] = "C1";
      sa[1] = "C2";
      sa[2] = "C3";
      sa[3] = "C4";
    }
  }
}
// Column means of training data
class iris_km_model_MEANS implements java.io.Serializable {
  public static final double[] VALUES = new double[4];
  static {
    iris_km_model_MEANS_0.fill(VALUES);
  }
  static final class iris_km_model_MEANS_0 implements java.io.Serializable {
    static final void fill(double[] sa) {
      sa[0] = 5.843333333333333;
      sa[1] = 3.053999999999999;
      sa[2] = 3.758666666666667;
      sa[3] = 1.1986666666666665;
    }
  }
}
// Reciprocal of column standard deviations of training data
class iris_km_model_MULTS implements java.io.Serializable {
  public static final double[] VALUES = new double[4];
  static {
    iris_km_model_MULTS_0.fill(VALUES);
  }
  static final class iris_km_model_MULTS_0 implements java.io.Serializable {
    static final void fill(double[] sa) {
      sa[0] = 1.2076330213409388;
      sa[1] = 2.3063033203973875;
      sa[2] = 0.5667583466456685;
      sa[3] = 1.3103399393571;
    }
  }
}
// Mode for categorical columns
class iris_km_model_MODES implements java.io.Serializable {
  public static final int[] VALUES = new int[4];
  static {
    iris_km_model_MODES_0.fill(VALUES);
  }
  static final class iris_km_model_MODES_0 implements java.io.Serializable {
    static final void fill(int[] sa) {
      sa[0] = -1;
      sa[1] = -1;
      sa[2] = -1;
      sa[3] = -1;
    }
  }
}
// Normalized cluster centers[K][features]
class iris_km_model_CENTERS implements java.io.Serializable {
  public static final double[][] VALUES = new double[3][];
  static {
    iris_km_model_CENTERS_0.fill(VALUES);
  }
  static class iris_km_model_CENTERS_0_0 implements java.io.Serializable {
    public static final double[] VALUES = new double[4];
    static {
      iris_km_model_CENTERS_0_0_0.fill(VALUES);
    }
    static final class iris_km_model_CENTERS_0_0_0 implements java.io.Serializable {
      static final void fill(double[] sa) {
        sa[0] = -0.0685873626223117;
        sa[1] = -0.8873945545098232;
        sa[2] = 0.3438624614956358;
        sa[3] = 0.2839741837806723;
      }
    }
}
  static class iris_km_model_CENTERS_0_1 implements java.io.Serializable {
    public static final double[] VALUES = new double[4];
    static {
      iris_km_model_CENTERS_0_1_0.fill(VALUES);
    }
    static final class iris_km_model_CENTERS_0_1_0 implements java.io.Serializable {
      static final void fill(double[] sa) {
        sa[0] = 1.1276273336771014;
        sa[1] = 0.08687075840163734;
        sa[2] = 0.9821922147369432;
        sa[3] = 0.9954215739316106;
      }
    }
}
  static class iris_km_model_CENTERS_0_2 implements java.io.Serializable {
    public static final double[] VALUES = new double[4];
    static {
      iris_km_model_CENTERS_0_2_0.fill(VALUES);
    }
    static final class iris_km_model_CENTERS_0_2_0 implements java.io.Serializable {
      static final void fill(double[] sa) {
        sa[0] = -1.0111913832028125;
        sa[1] = 0.8394944086246519;
        sa[2] = -1.3005214861029282;
        sa[3] = -1.2509378621062437;
      }
    }
}
  static final class iris_km_model_CENTERS_0 implements java.io.Serializable {
    static final void fill(double[][] sa) {
      sa[0] = iris_km_model_CENTERS_0_0.VALUES;
      sa[1] = iris_km_model_CENTERS_0_1.VALUES;
      sa[2] = iris_km_model_CENTERS_0_2.VALUES;
    }
  }
}

Create main java program

Now you need to write the main program main.java and save it with in the same directory with iris_km_model.java. The program is supposed to take four values and return the predicted cluster.

Note that you need to:

  • provide the the pojo model name included in the file downloaded earlier in the previous step(iris_km_model).

  • use the right class for your model, which is ClusteringModelPrediction here. You can see further details about classes on the POJO Model Javadoc page

import java.io.*;
import hex.genmodel.easy.RowData;
import hex.genmodel.easy.EasyPredictModelWrapper;
import hex.genmodel.easy.prediction.*;

public class main {
  private static String modelClassName = "iris_km_model";

  public static void main(String[] args) throws Exception {
    hex.genmodel.GenModel rawModel;
    rawModel = (hex.genmodel.GenModel) Class.forName(modelClassName).newInstance();
    EasyPredictModelWrapper model = new EasyPredictModelWrapper(rawModel);

    RowData row = new RowData();
    row.put("C1", args[0]);
    row.put("C2", args[1]);
    row.put("C3", args[2]);
    row.put("C4", args[3]);

    ClusteringModelPrediction p = model.predictClustering(row);

    System.out.printf("Input values: %s %s %s %s \n", args[0], args[1], args[2], args[3]);
    System.out.printf("cluster: %s", p.cluster);
  }
}

Compile java program

Now it is time to compile your program, but first you need to download h2o-genmodel.jar in the same directory with iris_km_model.java and main.java.

$ cd saved_models
$ curl http://localhost:54321/3/h2o-genmodel.jar > h2o-genmodel.jar

Having the three files, you can compile your program as follows:

$ javac -cp h2o-genmodel.jar -J-Xmx2g -J-XX:MaxPermSize=128m iris_km_model.java main.java

If things work fine you will get no errors and you will find new .class files generated in the same directory.

Make predictions

Now you are ready to use the compiled program and make predictions. For this, you should write java -cp ".;h2o-genmodel.jar" main followed by the four expected inputs (corresponding to C1, C2, C3, C4) as follows:

$java -cp ".;h2o-genmodel.jar" main  6.4 3.2 5.3 2.3

The printed result will include whatever format you specified in the main.java program. Here I set it to return the input values then the predicted cluster.

Input values: 6.4 3.2 5.3 2.3
cluster: 1

NOTE:

  • I used ; in $java -cp ".;h2o-genmodel.jar" as I am using Windows. Probably it will differ with other operating systems and you will need to use :.

  • There could be better ways to do the same thing, but this was a minimal example with some basics for demo purposes.

Session info

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] h2o_3.22.1.1

loaded via a namespace (and not attached):
 [1] compiler_3.5.0  bookdown_0.9    htmltools_0.3.6 tools_3.5.0     RCurl_1.95-4.12 yaml_2.2.0     
 [7] Rcpp_1.0.1      rmarkdown_1.12  blogdown_0.11   knitr_1.22      jsonlite_1.6    xfun_0.5       
[13] digest_0.6.18   bitops_1.0-6    evaluate_0.13 
comments powered by Disqus