Introduction

This repository shows how to implement a REST server for low-latency image classification (inference) using NVIDIA GPUs. This is an initial demonstration of the GRE (GPU REST Engine) software that will allow you to build your own accelerated microservices.

This demonstration makes use of several technologies with which you may be familiar:

Docker: for bundling all the dependencies of our program and for easier deployment.
Go: for its efficient builtin HTTP server.
Caffe: because it has good performance and a simple C++ API.
cuDNN: for accelerating common deep learning primitives on the GPU.
OpenCV: to have a simple C++ API for GPU image processing.

Building

Prerequisites

A Kepler or Maxwell NVIDIA GPU with at least 2 GB of memory.
A Linux system with recent NVIDIA drivers (recommended: 352.79).
Install the latest version of Docker.
Install nvidia-docker, prefer using the deb package if you are on Ubuntu.

Build command

The command might take a while to execute:

$ docker build -t inference_server -f Dockerfile.inference_server .

To speedup the build you can modify this line to only build for the GPU architecture that you need.

Testing

Starting the server

Execute the following command and wait a few seconds for the initialization of the Caffe classifiers:

$ nvidia-docker run --name=server --net=host --rm inference_server

If you encounter an error when executing the command above, you might need to initialize nvidia-docker with the following command:

$ sudo nvidia-docker volume setup

You can use the environment variable NV_GPU to isolate GPUs for this container.

Single image

Since we used --net=host, we can access our inference server from a terminal on the host using curl:

$ curl -XPOST --data-binary @images/1.jpg http://127.0.0.1:8000/api/classify
[{"confidence":0.9998,"label":"n02328150 Angora, Angora rabbit"},{"confidence":0.0001,"label":"n02325366 wood rabbit, cottontail, cottontail rabbit"},{"confidence":0.0001,"label":"n02326432 hare"},{"confidence":0.0000,"label":"n02085936 Maltese dog, Maltese terrier, Maltese"},{"confidence":0.0000,"label":"n02342885 hamster"}]

Benchmarking performance

We can benchmark the performance of our classification server using any tool that can generate HTTP load. We included a Dockerfile for a benchmarking client using a modified version of rakyll/boom:

$ docker build -t inference_client -f Dockerfile.inference_client .
$ docker run -e CONCURRENCY=8 -e REQUESTS=20000 --net=host inference_client

If you have Go installed on your host, you can also benchmark the server with a client outside of a Docker container:

$ go get github.com/flx42/boom
$ boom -n 200000 -m POST -d @images/2.jpg http://127.0.0.1:8000/api/classify

Performance on a NVIDIA DIGITS DevBox

This machine has 4 GeForce GTX Titan X GPUs:

$ boom -c 8 -n 200000 -m POST -d @images/2.jpg http://127.0.0.1:8000/api/classify
Summary:
  Total:        126.5910 secs.
  Slowest:      0.0384 secs.
  Fastest:      0.0033 secs.
  Average:      0.0051 secs.
  Requests/sec: 1579.8914
  Total Data Received:  68800000 bytes.
  Response Size per Request:    344 bytes.
[...]

As a comparison, Caffe in standalone mode achieves 405 images / second on a single Titan X for inference (batch=1). This shows that our code achieves optimal GPU utilization and good multi-GPU scaling, even when adding a REST API on top. A discussion of GPU performance for inference at different batch sizes can be found in our GPU-Based Deep Learning Inference whitepaper.

This inference server is aimed for low-latency applications, to achieve higher throughput we would need to batch multiple incoming client requests, or have clients send multiple images to classify. Batching can be added easily when using the C++ API of Caffe.

Benchmarking overhead of CUDA kernel calls

Similarly to the inference server, a simple server code is provided for estimating the overhead of using CUDA kernels in your code. The server will simply call an empty CUDA kernel before responding 200 to the client. The server can be built using the same commands as above:

$ docker build -t benchmark_server -f Dockerfile.benchmark_server .
$ nvidia-docker run --name=server --net=host --rm benchmark_server

And for the client:

$ docker build -t benchmark_client -f Dockerfile.benchmark_client .
$ docker run -e CONCURRENCY=8 -e REQUESTS=200000 --net=host benchmark_client
[...]
Summary:
  Total:        5.8071 secs
  Slowest:      0.0127 secs
  Fastest:      0.0001 secs
  Average:      0.0002 secs
  Requests/sec: 34440.3083

Contributing

Feel free to report issues during build or execution. We also welcome suggestions to improve the performance of this application.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
benchmark		benchmark
images		images
inference		inference
Dockerfile.benchmark_client		Dockerfile.benchmark_client
Dockerfile.benchmark_server		Dockerfile.benchmark_server
Dockerfile.inference_client		Dockerfile.inference_client
Dockerfile.inference_server		Dockerfile.inference_server
LICENSE		LICENSE
README.md		README.md
caffe.sh		caffe.sh
common.h		common.h
install.sh		install.sh

License

cepko33/gpu-rest-engine

Folders and files

Latest commit

History

Repository files navigation

Introduction

Building

Prerequisites

Build command

Testing

Starting the server

Single image

Benchmarking performance

Performance on a NVIDIA DIGITS DevBox

Benchmarking overhead of CUDA kernel calls

Contributing

About

Resources

License

Stars

Watchers

Forks

Languages