Models

Real names. Real GPUs. No surprises.

GPUBox tells you exactly which model serves your request. No opaque endpoints, no silent swaps, no "mystery model" pricing. You name the model in your code; we serve that model.

Chat / LLM

liveApache 2.0AWQ-int4

qwen2.5-32b-instruct

Qwen2.5-32B-Instruct from Alibaba — strong general-purpose LLM at the 32B parameter class. Reliable function calling, decent reasoning, fast on consumer-grade hardware via 4-bit quantisation.

Context

8,192 tokens

Hardware

RTX 5090

Endpoint

/v1/chat/completions

Capabilities

Chat completions (OpenAI-compatible)
Streaming SSE
Tool / function calling
JSON mode (response_format)
Multilingual: English, Chinese, Spanish, French, German, etc.

Speech-to-text

liveMITfp16

whisper-large-v3-turbo

OpenAI's Whisper large-v3-turbo via faster-whisper. Real-time-factor ~0.3 on the 5090: a 60-second clip transcribes in roughly 18 seconds.

Context

30 second windows

Hardware

RTX 5090

Endpoint

/v1/audio/transcriptions

Capabilities

OpenAI-compatible /v1/audio/transcriptions
Multipart upload (file + model + optional language/prompt/temperature)
100+ languages with auto-detection
verbose_json response with segment-level timestamps and confidence
Voice-activity detection (VAD) filter

Embeddings

liveMITfp16

bge-m3

BAAI BGE-M3 — strong multilingual embeddings, 8k context. Returns L2-normalised 1024-d dense vectors via the OpenAI embeddings shape.

Context

8,192 tokens

Hardware

Ryzen 9 9950X (CPU)

Endpoint

/v1/embeddings

Capabilities

OpenAI-compatible /v1/embeddings
Multilingual (100+ languages)
1,024-dimensional dense vectors
L2-normalised by default

Want a model we don't serve yet?

We add open-weight models on customer demand. Tell us what you need.

hello@gpubox.ai

Ready to use one? Read the quickstart →