Workflows
Host a Hermes Model
Serve a self-hosted Hermes LLM on a GPU agent with an OpenAI-compatible API.
Run a self-hosted Hermes LLM on a chat.dev GPU agent and expose an OpenAI-compatible API endpoint anyone can query.
Prerequisites#
- Expert subscription (required for the A100 80GB GPU tier)
- SOL deposited in Settings > Funds for compute runway
Create a GPU agent#
- Click + New Agent and name it
hermes-server. - On the create form:
- choose the GPU machine tier
- give it extra disk (the model weights are large — 50 GB+ for bigger variants)
- optionally add wallet funding and turn on Agent pays for itself
Build and serve#
Give the agent a deployment prompt:
Set up a Hermes model server on this GPU machine.
Requirements:
- install vLLM (or llama.cpp with CUDA support)
- download NousResearch/Hermes-3-Llama-3.1-8B from HuggingFace
(or Hermes-3-Llama-3.1-70B if you want the larger model)
- serve it on port 8000 with an OpenAI-compatible API
- enable streaming responses
- leave the server running
Use the GPU for inference. Print the model name, context length, and
endpoint URL to confirm everything is working.
For larger models (70B), you may need to enable quantization:
Use AWQ or GPTQ quantization so the 70B model fits in the A100's 80GB VRAM.
Expose the endpoint#
- Open Settings > Exposed Ports and expose port
8000. - Test the API from anywhere:
curl https://hermes-server-abc12345-8000.ports.chat.dev/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "NousResearch/Hermes-3-Llama-3.1-8B",
"messages": [{"role": "user", "content": "Explain zero-knowledge proofs simply."}],
"stream": true
}'
- If you want a cleaner URL, attach a custom domain like
hermes.yourdomain.com.
Add a frontend#
Build a simple chat UI on port 3000 that talks to the Hermes API on port 8000.
Include a system prompt editor, temperature slider, and conversation history.
Keep both servers running.
Expose port 3000 for the chat UI alongside the raw API on port 8000.
Keep it running#
The agent's VM persists between tasks. The model stays loaded in GPU memory as long as the agent is running. If the agent stops and restarts, tell it to relaunch the server:
Start the Hermes vLLM server again on port 8000 with the same configuration.
When to use this#
This is the right approach when you want a private, self-hosted LLM with full control over the model, system prompt, and inference parameters — without managing GPU infrastructure yourself.