|
| 1 | +--- |
| 2 | +sidebar_position: 8 |
| 3 | +--- |
| 4 | + |
| 5 | +# Distributed Inference of DeepSeek model on Raspberry Pi |
| 6 | + |
| 7 | +## Introduction |
| 8 | + |
| 9 | +This wiki explains how to deploy the [DeepSeek](https://github.com/deepseek-ai/DeepSeek-LLM) model on Multiple Raspberry Pi AI Boxs with [distributed-llama](https://github.com/b4rtaz/distributed-llama).In this wiki, I used a **Raspberry Pi with 8GB of RAM** as the **root node** and **three Raspberry Pis with 4GB of RAM** as **worker nodes** to run the **DeepSeek 8B model**. The inference speed reached **6.06 tokens per second**. |
| 10 | + |
| 11 | +## Prepare Hardware |
| 12 | + |
| 13 | +<div class="table-center"> |
| 14 | + <table align="center"> |
| 15 | + <tr> |
| 16 | + <th>reComputer AI R2130</th> |
| 17 | + </tr> |
| 18 | + <tr> |
| 19 | + <td><div style={{textAlign:'center'}}><img src="https://media-cdn.seeedstudio.com/media/catalog/product/cache/bb49d3ec4ee05b6f018e93f896b8a25d/1/_/1_24_1.jpg" style={{width:600, height:'auto'}}/></div></td> |
| 20 | + </tr> |
| 21 | + <tr> |
| 22 | + <td><div class="get_one_now_container" style={{textAlign: 'center'}}> |
| 23 | + <a class="get_one_now_item" href="https://www.seeedstudio.com/reComputer-AI-R2130-12-p-6368.html"> |
| 24 | + <strong><span><font color={'FFFFFF'} size={"4"}> Get One Now 🖱️</font></span></strong> |
| 25 | + </a> |
| 26 | + </div></td> |
| 27 | + </tr> |
| 28 | + </table> |
| 29 | +</div> |
| 30 | + |
| 31 | +## Prepare software |
| 32 | + |
| 33 | +### update the system: |
| 34 | + |
| 35 | +Open one terminal with `Ctrl+Alt+T` and input command like below: |
| 36 | + |
| 37 | +``` |
| 38 | +sudo date -s "$(wget -qSO- --max-redirect=0 google.com 2>&1 | grep Date: | cut -d' ' -f5-8)Z" |
| 39 | +sudo apt update |
| 40 | +sudo apt full-upgrade |
| 41 | +``` |
| 42 | + |
| 43 | +### Install ditributed llama to your root and worker node |
| 44 | + |
| 45 | +Open one terminal with `Ctrl+Alt+T` and input command like below to install [distributed-llama](https://github.com/b4rtaz/distributed-llama.git): |
| 46 | + |
| 47 | +``` |
| 48 | +git clone https://github.com/b4rtaz/distributed-llama.git |
| 49 | +cd distributed-llama |
| 50 | +make dllama |
| 51 | +make dllama-api |
| 52 | +``` |
| 53 | + |
| 54 | +### Run on your woker node |
| 55 | + |
| 56 | +Then input command like below to make worker nodes working: |
| 57 | + |
| 58 | +``` |
| 59 | +cd distributed-llama |
| 60 | +sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4 |
| 61 | +``` |
| 62 | + |
| 63 | +### Run on your root node |
| 64 | + |
| 65 | +#### Creat and activate python vitural environment |
| 66 | + |
| 67 | +``` |
| 68 | +cd distributed-llama |
| 69 | +python -m venv .env |
| 70 | +source .env/bin/acitvate |
| 71 | +``` |
| 72 | + |
| 73 | +#### Install necessary lib |
| 74 | + |
| 75 | +``` |
| 76 | +pip install numpy==1.23.5 |
| 77 | +pip install tourch=2.0.1 |
| 78 | +pip install safetensors==0.4.2 |
| 79 | +pip install sentencepiece==0.1.99 |
| 80 | +pip install transformers |
| 81 | +``` |
| 82 | + |
| 83 | +#### Install deepseek 8b q40 model |
| 84 | + |
| 85 | +``` |
| 86 | +sudo mkdir model && cd model |
| 87 | +git lfs install |
| 88 | +git clone https://huggingface.co/b4rtaz/Llama-3_1-8B-Q40-Instruct-Distributed-Llama |
| 89 | +``` |
| 90 | + |
| 91 | +#### Run distributed inference on root node |
| 92 | + |
| 93 | +> **Note:** `--workers 10.0.0.139:9998 10.0.0.175:9998 10.0.0.124:9998` is the IP of the workers. |
| 94 | +
|
| 95 | +``` |
| 96 | +cd .. |
| 97 | +./dllama chat --model ./model/dllama_model_deepseek-r1-distill-llama-8b_q40.m --tokenizer ./model/dllama_tokenizer_deepseek-r1-distill-llama-8b.t --buffer-float-type q80 --prompt "What is 5 plus 9 minus 3?" --nthreads 4 --max-seq-len 2048 --workers 10.0.0.139:9998 10.0.0.175:9998 10.0.0.124:9998 --steps 256 |
| 98 | +
|
| 99 | +``` |
| 100 | + |
| 101 | +> **Note:** If you want to test the inference speed, please use the following command. |
| 102 | +
|
| 103 | +``` |
| 104 | +cd .. |
| 105 | +./dllama inference --model ./model/dllama_model_deepseek-r1-distill-llama-8b_q40.m --tokenizer ./model/dllama_tokenizer_deepseek-r1-distill-llama-8b.t --buffer-float-type q80 --prompt "What is 5 plus 9 minus 3?" --nthreads 4 --max-seq-len 2048 --workers 10.0.0.139:9998 10.0.0.175:9998 10.0.0.124:9998 --steps 256 |
| 106 | +``` |
| 107 | + |
| 108 | +## Result |
| 109 | + |
| 110 | +The following is the inference of the [DeepSeek Llama 8b](https://huggingface.co/b4rtaz/Llama-3_1-8B-Q40-Instruct-Distributed-Llama) model using 4 the Raspberry Pi. |
| 111 | + |
| 112 | + |
| 113 | +<div align="center"> |
| 114 | + <img width={900} |
| 115 | + src="https://files.seeedstudio.com/wiki/distributed-inference/distributed_llama.gif" /> |
| 116 | +</div> |
| 117 | + |
0 commit comments