Show HN: Slashing LLM Costs for Overnight Batch Inference

2 points by Blue_Cosma 4 hours ago

Hey HN,

If you've tried running open-source models like Llama 3.1 70B or 405B, you might have noticed that it gets very expensive. The reasons look obvious enough that you might have stopped even before trying it! - GPUs are very expensive to buy or rent - Running the most performing LLMs need 4, 8 or even 16 top of the line Nvidia GPUs - And that won’t get you anywhere near the level of VRAM needed to batch enough to get a decent throughput and efficiency

Some have even questioned if open-source LLM providers are not doing some shenanigans to provide the prices they offer. VC funded bait-and-switch? Unclear quantization? Even the most well funded LLM inference startups, with the best inference optimization teams in the world have got into controversy about this.

At EXXA, we wanted to make affordable the best open-source LLMs in all their FP16 glory. And I don’t know for others, but we’re a bootstrapped team of 3, so the subsidizing part isn’t an option :D

I won’t tell you we found the magical solution for all use cases… But we found one for batch overnight jobs! Think things like: - Synthetic data generation - Data pre-processing (e.g. contextual retrieval for RAG improvements, knowledge graph creation) - or LLM-as-a-judge evaluation

Why overnight? Because it gives us time to: - Get GPU for a high discount (30-90%) as they would otherwise sit idle in cloud providers data centers - Heavily optimize inference for maximum throughput instead of minimum latency

Today, our batch inference API is live for Llama 3.1 8B & 70B FP16 with output under 24h.

We offer the lowest price per token in the market!

60% cheaper than fireworks, 40% cheaper than deepinfra. Without any hard rate limits and with prompt caching available.

Try it now at https://withexxa.com

------

If you have any specific questions: you can email us at founders@withexxa.com If you want to generate a large amount of tokens with custom LLM models, let us know we can host them and offer the same price ranges as Llama 3.1 8B & 70B. If you want to reduce inference costs for image or video generation, we are actively looking into this with potential users.

What do you think of our approach? Are you willing to wait overnight for super cheap tokens?