
NVIDIA Triton推理服务器在阿里云机器学习平台启动公测
2021-08-06 14:15
什么是 NVIDA Triton
NVIDIA Triton 推理服务器(NVIDIA Triton Inference Server) 是NVIDIA推出的开源推理框架,为用户提供在云和边缘推理上部署的解决方案。下图为 Triton 的架构图:

什么是 PAI-EAS?

为什么将Triton部署在 EAS 上
如何使用Triton on EAS
name: "inception_graphdef"
platform: "tensorflow_graphdef"
max_batch_size: 128
input [
{
name: "input"
data_type: TYPE_FP32
format: FORMAT_NHWC
dims: [ 299, 299, 3 ]
}
]
output [
{
name: "InceptionV3/Predictions/Softmax"
data_type: TYPE_FP32
dims: [ 1001 ]
label_filename: "inception_labels.txt"
}
]
./ossutil cp inception_graphdef/ oss://triton-model-repo/models
{
"name": "triton_test",
"processor": "triton",
"processor_params": [
"--model-repository=oss://triton-model-repo/models",
"--allow-http=true",
],
"metadata": {
"instance": 1,
"cpu": 4,
"gpu": 1,
"memory": 10000,
"resource":"
}
}
./eascmd create triton.config
[RequestId]: AECDB6A4-CB69-4688-AA35-BA1E020C39E6
+-------------------+------------------------------------------------------------------------------------------------+
| Internet Endpoint | http://****************.cn-shanghai.pai-eas.aliyuncs.com/api/predict/test_triton_processor |
| Intranet Endpoint | http://****************.vpc.cn-shanghai.pai-eas.aliyuncs.com/api/predict/test_triton_processor |
| Token | MmY3M2ExZG***********************WM1ZTBiOWQ3MGYxZGNkZQ== |
+-------------------+------------------------------------------------------------------------------------------------+
[OK] Service is now deploying
[OK] Successfully synchronized resources
[OK] Waiting [Total: 1, Pending: 1, Running: 0]
[OK] Waiting [Total: 1, Pending: 1, Running: 0]
[OK] Running [Total: 1, Pending: 0, Running: 1]
[OK] Service is running
pip3 install nvidia-pyindex
pip3 install tritonclient[all]

import numpy as np
import time
import tritonclient.http as httpclient
from tritonclient.utils import InferenceServerException
URL = "
HEADERS = {"Authorization": "
input_img = httpclient.InferInput("input", [1, 299, 299, 3], "FP32")
rand_img = np.random.rand(1, 299, 299, 3).astype(np.float32)
input_img.set_data_from_numpy(rand_img, binary_data=True)
output = httpclient.InferRequestedOutput(
"InceptionV3/Predictions/Softmax", binary_data=True
)
triton_client = httpclient.InferenceServerClient(url=URL, verbose=False)
start = time.time()
for i in range(10):
results = triton_client.infer(
"inception_graphdef", inputs=[input_img], outputs=[output], headers=HEADERS
)
res_body = results.get_response()
elapsed_ms = (time.time() - start) * 1000
if i == 0:
print("model name: ", res_body["model_name"])
print("model version: ", res_body["model_version"])
print("output name: ", res_body["outputs"][0]["name"])
print("output shape: ", res_body["outputs"][0]["shape"])
print("[{}] Avg rt(ms): {:.2f}".format(i, elapsed_ms))
start = time.time()
model name: inception_graphdef
model version: 1
output name: InceptionV3/Predictions/Softmax
output shape: [1, 1001]
[0] Avg rt(ms): 76.29
[1] Avg rt(ms): 43.45
[2] Avg rt(ms): 37.30
[3] Avg rt(ms): 34.17
[4] Avg rt(ms): 32.26
[5] Avg rt(ms): 30.52
[6] Avg rt(ms): 30.34
[7] Avg rt(ms): 29.13
[8] Avg rt(ms): 23.56
[9] Avg rt(ms): 23.42
结语