Skip to content

Commit 8500f5d

Browse files
Merge pull request #2515 from kevincheng2/develop
[LLM] Support deploy LLM model
2 parents cd0ee79 + 10c6bde commit 8500f5d

37 files changed

Lines changed: 4459 additions & 3 deletions

.gitignore

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,4 +50,20 @@ python/fastdeploy/code_version.py
5050
log.txt
5151
serving/build
5252
serving/build.encrypt
53-
serving/build.encrypt.auth
53+
serving/build.encrypt.auth
54+
output
55+
res
56+
tmp
57+
log
58+
nohup.out
59+
llm/server/__pycache__
60+
llm/server/data/__pycache__
61+
llm/server/engine/__pycache__
62+
llm/server/http_server/__pycache__
63+
llm/server/log/
64+
llm/client/build/
65+
llm/client/dist/
66+
llm/client/fastdeploy_client.egg-info/
67+
llm/client/fastdeploy_client/tests/log/
68+
*.pyc
69+
*.log

.pre-commit-config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
repos:
22
- repo: https://github.com/pre-commit/pre-commit-hooks
3-
rev: a11d9314b22d8f8c7556443875b731ef05965464
3+
rev: ed714747d7acbc5790b171702bb012af3b0fe145
44
hooks:
55
- id: check-merge-conflict
66
- id: check-symlinks
@@ -9,8 +9,8 @@ repos:
99
- id: detect-private-key
1010
- id: check-symlinks
1111
- id: check-added-large-files
12-
- repo: local
1312

13+
- repo: local
1414
hooks:
1515
- id: copyright_checker
1616
name: copyright_checker

llm/.dockerignore

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
README.md
2+
requirements-dev.txt
3+
pyproject.toml
4+
Makefile
5+
6+
dockerfiles/
7+
docs/
8+
server/__pycache__
9+
server/http_server
10+
server/engine
11+
server/data

llm/README.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
2+
<h1 align="center"><b><em>飞桨大模型高性能部署工具FastDeploy</em></b></h1>
3+
4+
*FastDeploy基于英伟达Triton框架专为服务器场景的大模型服务化部署而设计的解决方案。它提供了支持gRPC、HTTP协议的服务接口,以及流式Token输出能力。底层推理引擎支持连续批处理、weight only int8、后训练量化(PTQ)等加速优化策略,为用户带来易用且高性能的部署体验。*
5+
6+
# 快速开始
7+
8+
基于预编译镜像部署,本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例,更多模型请参考[LLaMA](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/llama.md)[Qwen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/qwen.md)[Mixtral](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/mixtral.md), 更细致的模型推理、量化教程可以参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md)
9+
10+
```
11+
# 下载模型
12+
wget https://paddle-qa.bj.bcebos.com/inference_model/Meta-Llama-3-8B-Instruct-A8W8C8.tar
13+
mkdir Llama-3-8B-A8W8C8 && tar -xf Meta-Llama-3-8B-Instruct-A8W8C8.tar -C Llama-3-8B-A8W8C8
14+
15+
# 挂载模型文件
16+
export MODEL_PATH=${PWD}/Llama-3-8B-A8W8C8
17+
18+
docker run --gpus all --shm-size 5G --network=host \
19+
-v ${MODEL_PATH}:/models/ \
20+
-dit registry.baidubce.com/paddlepaddle/fastdeploy:llm-serving-cuda123-cudnn9-v1.0 \
21+
bash -c 'export USE_CACHE_KV_INT8=1 && cd /opt/output/Serving && bash start_server.sh; exec bash'
22+
```
23+
24+
等待服务启动成功(服务初次启动大概需要40s),可以通过以下命令测试:
25+
26+
```
27+
curl 127.0.0.1:9965/v1/chat/completions \
28+
-H 'Content-Type: application/json' \
29+
-d '{"text": "hello, llm"}'
30+
```
31+
32+
Note:
33+
1. 请保证 shm-size >= 5,不然可能会导致服务启动失败
34+
35+
更多关于 FastDeploy 的使用方法,请查看[服务化部署流程](https://github.com/PaddlePaddle/FastDeploy/blob/develop/llm/docs/FastDeploy_usage_tutorial.md)
36+
37+
# License
38+
39+
FastDeploy 遵循 [Apache-2.0开源协议](https://github.com/PaddlePaddle/FastDeploy/blob/develop/LICENSE)

llm/client/README.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# 客户端使用方式
2+
3+
## 简介
4+
5+
FastDeploy客户端提供命令行接口和Python接口,可以快速调用FastDeploy后端部署的LLM模型服务。
6+
7+
## 安装
8+
9+
源码安装
10+
```
11+
pip install .
12+
```
13+
14+
## 命令行接口
15+
16+
首先通过环境变量设置模型服务模式、模型服务URL、模型ID,然后使用命令行接口调用模型服务。
17+
18+
| 参数 | 说明 | 是否必填 | 默认值 |
19+
| --- | --- | --- | --- |
20+
| FASTDEPLOY_MODEL_URL | 模型服务部署的IP地址和端口,格式为`x.x.x.x:xxx`|| |
21+
22+
```
23+
export FASTDEPLOY_MODEL_URL="x.x.x.x:xxx"
24+
25+
# 流式接口
26+
fdclient stream_generate "你好?"
27+
28+
# 非流式接口
29+
fdclient generate "你好,你是谁?"
30+
```
31+
32+
## Python接口
33+
34+
首先通过Python代码设置模型服务URL(hostname+port),然后使用Python接口调用模型服务。
35+
36+
| 参数 | 说明 | 是否必填 | 默认值 |
37+
| --- | --- | --- | --- |
38+
| hostname+port | 模型服务部署的IP地址和端口,格式为`x.x.x.x。 || |
39+
40+
41+
```
42+
from fastdeploy_client.chatbot import ChatBot
43+
44+
hostname = "x.x.x.x"
45+
port = xxx
46+
47+
# 流式接口,stream_generate api的参数说明见附录
48+
chatbot = ChatBot(hostname=hostname, port=port)
49+
stream_result = chatbot.stream_generate("你好", topp=0.8)
50+
for res in stream_result:
51+
print(res)
52+
53+
# 非流式接口,generate api的参数说明见附录
54+
chatbot = ChatBot(hostname=hostname, port=port)
55+
result = chatbot.generate("你好", topp=0.8)
56+
print(result)
57+
```
58+
59+
### 接口说明
60+
```
61+
ChatBot.stream_generate(message,
62+
max_dec_len=1024,
63+
min_dec_len=2,
64+
topp=0.0,
65+
temperature=1.0,
66+
frequency_score=0.0,
67+
penalty_score=1.0,
68+
presence_score=0.0,
69+
eos_token_ids=254186)
70+
71+
# 此函数返回一个iterator,其中每个元素为一个dict, 例如:{"token": "好的", "is_end": 0}
72+
# 其中token为生成的字符,is_end表明是否为生成的最后一个字符(0表示否,1表示是)
73+
# 注意:当生成结果出错时,返回错误信息;不同模型的eos_token_ids不同
74+
```
75+
76+
```
77+
ChatBot.generate(message,
78+
max_dec_len=1024,
79+
min_dec_len=2,
80+
topp=0.0,
81+
temperature=1.0,
82+
frequency_score=0.0,
83+
penalty_score=1.0,
84+
presence_score=0.0,
85+
eos_token_ids=254186)
86+
87+
# 此函数返回一个,例如:{"results": "好的,我知道了。"},其中results即为生成结果
88+
# 注意:当生成结果出错时,返回错误信息;不同模型的eos_token_ids不同
89+
```
90+
91+
### 参数说明
92+
93+
| 字段名 | 字段类型 | 说明 | 是否必填 | 默认值 | 备注 |
94+
| :---: | :-----: | :---: | :---: | :-----: | :----: |
95+
| req_id | str | 请求ID,用于标识一个请求。建议设置req_id,保证其唯一性 || 随机id | 如果推理服务中同时有两个相同req_id的请求,会返回req_id重复的错误信息 |
96+
| text | str | 请求的文本 ||| |
97+
| max_dec_len | int | 最大生成token的长度,如果请求的文本token长度加上max_dec_len大于模型的max_seq_len,会返回长度超限的错误信息 || max_seq_len减去文本token长度 | |
98+
| min_dec_len | int | 最小生成token的长度,最小是1 || 1 | |
99+
| topp | float | 控制随机性参数,数值越大则随机性越大,范围是0~1 || 0.7 | |
100+
| temperature | float | 控制随机性参数,数值越小随机性越大,需要大于 0 || 0.95 | |
101+
| frequency_score | float | 频率分数 || 0 | |
102+
| penalty_score | float | 惩罚分数 || 1 | |
103+
| presence_score | float | 存在分数 || 0 | |
104+
| stream | bool | 是否流式返回 || False | |
105+
| return_all_tokens | bool | 是否一次性返回所有结果 || False | 与stream参数差异见表后备注 |
106+
| timeout | int | 请求等待的超时时间,单位是秒 || 300 | |
107+
108+
* 在正确配置PUSH_MODE_HTTP_PORT字段下,服务支持 GRPC 和 HTTP 两种请求服务
109+
* stream 参数仅对 HTTP 请求生效
110+
* return_all_tokens 参数对 GRPC 和 HTTP 请求均有效
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import logging
16+
import sys
17+
18+
__version__ = "4.4.0"
19+
20+
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

0 commit comments

Comments
 (0)