gpubox.ai

Models

Real names. Real GPUs. No surprises.

GPUBox tells you exactly which model serves your request. No opaque endpoints, no silent swaps, no "mystery model" pricing. You name the model in your code; we serve that model.

Chat / LLM

liveApache 2.0AWQ-int4

qwen2.5-32b-instruct

Qwen2.5-32B-Instruct from Alibaba — strong general-purpose LLM at the 32B parameter class. Reliable function calling, decent reasoning, fast on consumer-grade hardware via 4-bit quantisation.

Context

8,192 tokens

Hardware

RTX 5090

Endpoint

/v1/chat/completions

Capabilities

  • Chat completions (OpenAI-compatible)
  • Streaming SSE
  • Tool / function calling
  • JSON mode (response_format)
  • Multilingual: English, Chinese, Spanish, French, German, etc.

Speech-to-text

liveMITfp16

whisper-large-v3-turbo

OpenAI's Whisper large-v3-turbo via faster-whisper. Real-time-factor ~0.3 on the 5090: a 60-second clip transcribes in roughly 18 seconds.

Context

30 second windows

Hardware

RTX 5090

Endpoint

/v1/audio/transcriptions

Capabilities

  • OpenAI-compatible /v1/audio/transcriptions
  • Multipart upload (file + model + optional language/prompt/temperature)
  • 100+ languages with auto-detection
  • verbose_json response with segment-level timestamps and confidence
  • Voice-activity detection (VAD) filter

Embeddings

liveMITfp16

bge-m3

BAAI BGE-M3 — strong multilingual embeddings, 8k context. Returns L2-normalised 1024-d dense vectors via the OpenAI embeddings shape.

Context

8,192 tokens

Hardware

Ryzen 9 9950X (CPU)

Endpoint

/v1/embeddings

Capabilities

  • OpenAI-compatible /v1/embeddings
  • Multilingual (100+ languages)
  • 1,024-dimensional dense vectors
  • L2-normalised by default

Want a model we don't serve yet?

We add open-weight models on customer demand. Tell us what you need.

hello@gpubox.ai