TVM User Tutorial -- Quick Start Tutorial for Compiling Deep Learning Models

Posted by dolcezza on Sun, 06 Mar 2022 13:26:32 +0100

Author: Yao Wang, Truman Tian

This example shows how to use the Relay python front end to build a neural network and generate a runtime library for Nvidia GPU with TVM. Note that you need to build TVM with cuda and llvm enabled.

Overview of hardware backend supported by TVM

The following figure shows the hardware backend currently supported by TVM:
In this tutorial, we will select cuda and llvm as the target backend. First, let's import Relay and TVM.

import numpy as np

from tvm import relay
from tvm.relay import testing
import tvm
from tvm import te
from tvm.contrib import graph_executor
import tvm.testing

Defining neural networks in real

First, let's define a neural network with a Relay python front end. For simplicity, we will use the predefined resnet-18 network in relay. Parameters are initialized using Xavier initializer. Relay also supports other model formats, such as MXNet, CoreML, ONNX, and Tensorflow.
In this tutorial, we assume that we will reason on our device and the batch size is set to 1. The input image is an RGB color image with a size of 224 * 224. We can call TVM relay. expr. TupleWrapper. Astext() displays the network structure.

batch_size = 1
num_class = 1000
image_shape = (3, 224, 224)
data_shape = (batch_size,) + image_shape
out_shape = (batch_size, num_class)

mod, params = relay.testing.resnet.get_workload(
    num_layers=18, batch_size=batch_size, image_shape=image_shape

# set show_meta_data=True if you want to show meta data


The next step is to compile the model using the Relay/TVM pipeline. You can specify the optimization level of compilation. At present, this value can be 0 to 3. Optimizing pass includes operator fusion, precomputation, layout transformation and so on. returns three components: the execution diagram in json format, the TVM module library of the function specially compiled for this diagram on the target hardware, and the parameter blob of the model. In the compilation process, Relay performs graph level optimization, while TVM performs sheet level optimization, so as to provide optimized runtime modules for model services.
We will first compile for Nvidia GPU. Behind the scenes, relay Build () first performs some graph level optimization, such as pruning, fusion, etc., and then registers the operator (i.e. the node of the optimization graph) to the TVM implementation to generate TVM module. In order to generate the module library, TVM will first convert the high-level IR into the low-level inherent IR of the specified target backend, in this case CUDA. The machine code will then be generated as a module library.

opt_level = 3
target =
with tvm.transform.PassContext(opt_level=opt_level):
    lib =, target, params=params)
/workspace/python/tvm/target/ UserWarning: Try specifying cuda arch by adding 'arch=sm_xx' to your target.
  warnings.warn("Try specifying cuda arch by adding 'arch=sm_xx' to your target.")

Run build library

Now we can create a graphics actuator and run the module on the Nvidia GPU

# create random input
dev = tvm.cuda()
data = np.random.uniform(-1, 1, size=data_shape).astype("float32")
# create module
module = graph_executor.GraphModule(lib["default"](dev))
# set input and parameters
module.set_input("data", data)
# run
# get output
out = module.get_output(0, tvm.nd.empty(out_shape)).numpy()

# Print first 10 elements of output
[0.00089283 0.00103331 0.0009094  0.00102275 0.00108751 0.00106737
 0.00106262 0.00095838 0.00110792 0.00113151]

Compile and load modules

We can also save graphics, libraries, and parameters to files and then load them in the deployment environment.

# save the graph, lib and params into separate files
from tvm.contrib import utils

temp = utils.tempdir()
path_lib = temp.relpath("deploy_lib.tar")
# load the module back.
loaded_lib = tvm.runtime.load_module(path_lib)
input_data = tvm.nd.array(data)

module = graph_executor.GraphModule(loaded_lib["default"](dev))
out_deploy = module.get_output(0).numpy()

# Print first 10 elements of output

# check whether the output from deployed module is consistent with original one
tvm.testing.assert_allclose(out_deploy, out, atol=1e-5)
[0.00089283 0.00103331 0.0009094  0.00102275 0.00108751 0.00106737
 0.00106262 0.00095838 0.00110792 0.00113151]

Topics: Python Deep Learning tvm