Models
Real names. Real GPUs. No surprises.
GPUBox tells you exactly which model serves your request. No opaque endpoints, no silent swaps, no "mystery model" pricing. You name the model in your code; we serve that model.
Chat / LLM
liveApache 2.0AWQ-int4qwen2.5-32b-instruct
Qwen2.5-32B-Instruct from Alibaba — strong general-purpose LLM at the 32B parameter class. Reliable function calling, decent reasoning, fast on consumer-grade hardware via 4-bit quantisation.
Context
8,192 tokens
Hardware
RTX 5090
Endpoint
/v1/chat/completions
Capabilities
- Chat completions (OpenAI-compatible)
- Streaming SSE
- Tool / function calling
- JSON mode (response_format)
- Multilingual: English, Chinese, Spanish, French, German, etc.
Reasoning LLM
liveApache 2.0fp16qwq-32b
Qwen QwQ-32B-Preview — a reasoning model that thinks out loud before answering. Replies INCLUDE its working-out as part of the content (no separate reasoning channel in this Preview release), so expect verbose, transparent answers. Pick this when you want to audit how the model reached its conclusion; pick Qwen2.5-32B-Instruct when you want a tight final answer only.
Context
32,768 tokens
Hardware
RTX PRO 6000 (Blackwell, 96GB)
Endpoint
/v1/chat/completions
Capabilities
- Chat completions (OpenAI-compatible)
- Streaming SSE
- Step-by-step reasoning visible inline in the response
- Strong on maths, code review, multi-step analysis
- Multilingual: English, Chinese
Vision / Multimodal LLM
liveApache 2.0AWQ (awq_marlin)qwen2.5-vl-7b-instruct
Qwen2.5-VL-7B-Instruct from Alibaba, a vision-language model. Send images inline as image_url (a base64 data-URI or an https URL) alongside your text prompt, on the same OpenAI-compatible /v1/chat/completions endpoint. Strong at screenshot/UI analysis, OCR, reading charts and document images, and general visual Q&A. Select model qwen2.5-vl-7b-instruct and add image_url parts to the message content.
Context
8,192 tokens
Hardware
RTX 5090
Endpoint
/v1/chat/completions
Capabilities
- Image understanding and description
- OCR (reading text in images)
- Visual question answering
- Screenshot / UI analysis
- Chart and document-image reading
- Chat completions (OpenAI-compatible)
- Streaming SSE
- Multimodal content (text + image_url)
- JSON mode (response_format)
Speech-to-text
liveMITfp16whisper-large-v3-turbo
OpenAI's Whisper large-v3-turbo via faster-whisper. Real-time-factor ~0.3 on the 5090: a 60-second clip transcribes in roughly 18 seconds.
Context
30 second windows
Hardware
RTX 5090
Endpoint
/v1/audio/transcriptions
Capabilities
- OpenAI-compatible /v1/audio/transcriptions
- Multipart upload (file + model + optional language/prompt/temperature)
- 100+ languages with auto-detection
- verbose_json response with segment-level timestamps and confidence
- Voice-activity detection (VAD) filter
Embeddings
liveMITfp16bge-m3
BAAI BGE-M3 — strong multilingual embeddings, 8k context. Returns L2-normalised 1024-d dense vectors via the OpenAI embeddings shape.
Context
8,192 tokens
Hardware
Ryzen 9 9950X (CPU)
Endpoint
/v1/embeddings
Capabilities
- OpenAI-compatible /v1/embeddings
- Multilingual (100+ languages)
- 1,024-dimensional dense vectors
- L2-normalised by default
Want a model we don't serve yet?
We add open-weight models on customer demand. Tell us what you need.
hello@gpubox.ai