This repository shows how to implement a REST server for low-latency image classification (inference) using NVIDIA GPUs. This is an initial demonstration of the GRE (GPU REST Engine) software that will allow you to build your own accelerated microservices.
This demonstration makes use of several technologies with which you may be familiar:
- Docker: for bundling all the dependencies of our program and for easier deployment.
- Go: for its efficient builtin HTTP server.
- Caffe: because it has good performance and a simple C++ API.
- cuDNN: for accelerating common deep learning primitives on the GPU.
- OpenCV: to have a simple C++ API for GPU image processing.
- A Kepler or Maxwell NVIDIA GPU with at least 2 GB of memory.
- A Linux system with recent NVIDIA drivers (recommended: 352.79).
- Install the latest version of Docker.
- Install nvidia-docker, prefer using the deb package if you are on Ubuntu.
The command might take a while to execute:
$ docker build -t inference_server -f Dockerfile.inference_server .
To speedup the build you can modify this line to only build for the GPU architecture that you need.
Execute the following command and wait a few seconds for the initialization of the Caffe classifiers:
$ nvidia-docker run --name=server --net=host --rm inference_server
If you encounter an error when executing the command above, you might need to initialize nvidia-docker
with the following command:
$ sudo nvidia-docker volume setup
You can use the environment variable NV_GPU
to isolate GPUs for this container.
Since we used --net=host
, we can access our inference server from a terminal on the host using curl
:
$ curl -XPOST --data-binary @images/1.jpg http://127.0.0.1:8000/api/classify
[{"confidence":0.9998,"label":"n02328150 Angora, Angora rabbit"},{"confidence":0.0001,"label":"n02325366 wood rabbit, cottontail, cottontail rabbit"},{"confidence":0.0001,"label":"n02326432 hare"},{"confidence":0.0000,"label":"n02085936 Maltese dog, Maltese terrier, Maltese"},{"confidence":0.0000,"label":"n02342885 hamster"}]
We can benchmark the performance of our classification server using any tool that can generate HTTP load. We included a Dockerfile for a benchmarking client using a modified version of rakyll/boom:
$ docker build -t inference_client -f Dockerfile.inference_client .
$ docker run -e CONCURRENCY=8 -e REQUESTS=20000 --net=host inference_client
If you have Go
installed on your host, you can also benchmark the server with a client outside of a Docker container:
$ go get github.com/flx42/boom
$ boom -n 200000 -m POST -d @images/2.jpg http://127.0.0.1:8000/api/classify
This machine has 4 GeForce GTX Titan X GPUs:
$ boom -c 8 -n 200000 -m POST -d @images/2.jpg http://127.0.0.1:8000/api/classify
Summary:
Total: 126.5910 secs.
Slowest: 0.0384 secs.
Fastest: 0.0033 secs.
Average: 0.0051 secs.
Requests/sec: 1579.8914
Total Data Received: 68800000 bytes.
Response Size per Request: 344 bytes.
[...]
As a comparison, Caffe in standalone mode achieves 405 images / second on a single Titan X for inference (batch=1
). This shows that our code achieves optimal GPU utilization and good multi-GPU scaling, even when adding a REST API on top. A discussion of GPU performance for inference at different batch sizes can be found in our GPU-Based Deep Learning Inference whitepaper.
This inference server is aimed for low-latency applications, to achieve higher throughput we would need to batch multiple incoming client requests, or have clients send multiple images to classify. Batching can be added easily when using the C++ API of Caffe.
Similarly to the inference server, a simple server code is provided for estimating the overhead of using CUDA kernels in your code. The server will simply call an empty CUDA kernel before responding 200
to the client. The server can be built using the same commands as above:
$ docker build -t benchmark_server -f Dockerfile.benchmark_server .
$ nvidia-docker run --name=server --net=host --rm benchmark_server
And for the client:
$ docker build -t benchmark_client -f Dockerfile.benchmark_client .
$ docker run -e CONCURRENCY=8 -e REQUESTS=200000 --net=host benchmark_client
[...]
Summary:
Total: 5.8071 secs
Slowest: 0.0127 secs
Fastest: 0.0001 secs
Average: 0.0002 secs
Requests/sec: 34440.3083
Feel free to report issues during build or execution. We also welcome suggestions to improve the performance of this application.