Marigold: Privately Hosted AI Inference on AWS

The major AI providers bundle two things that should be separate: the model and the infrastructure. When you call the OpenAI or Anthropic API, your data travels to their servers, runs against their model, and returns a result. The provider sits between you and the model. Whatever their policy says about logging, retention, or training, the architecture makes them a party to every exchange.

Open-weight models remove that party. Llama, Mistral, Qwen, and others publish their weights publicly. The model can run anywhere – on our infrastructure, on yours, or both. The provider is no longer in the room.

Marigold hosts these models on private AWS infrastructure in London. We consider ourselves a communication provider: the post office cannot read your mail, and neither should we. The inference runs, the result returns, nothing is retained. We do not train models. We collect usage metadata – model selection, request volumes, error rates – to improve the service. We do not collect content.

When your compliance requirements or scale demand it, the same Marigold interface runs on your own hardware. The application does not change. The model does not change. We simply step further out of the room.

A drop-in replacement for the major AI APIs

Marigold exposes the same interface as OpenAI and Anthropic. Existing application code that calls those APIs can point at Marigold instead without changes. The underlying models are open-weight equivalents running on private AWS infrastructure in London – not on shared cloud servers, not routed through a third-party API, not subject to a provider’s data retention policy.

(For a detailed account of the architecture and the case for private inference in regulated environments, see Private Inference: Running AI Inside Your Own Infrastructure.)

Infrastructure

The infrastructure runs on AWS in London within a private network boundary. GPU capacity handles larger models and high-throughput workloads. No request leaves that boundary.

If you are looking to establish your own boundaries contact us to discuss deployment on your own infrastructure.

Response encryption

Marigold accepts a public key with any inference request. When public_key is provided, the response is encrypted at the point of generation – before it leaves the inference boundary. Marigold never holds the corresponding private key and cannot read the response it has produced. The input travels in plaintext; the output exists only as ciphertext outside the inference boundary.

This creates a useful asymmetry for certain threat models. A future breach of server logs would expose inputs but not outputs. It also enables a specific multi-party pattern: Party A submits inputs to a prompt; the response is encrypted with a public key held by Party B. Party B receives output they can decrypt; Party A cannot read it. Marigold sees the input and produces the output but cannot reconstruct the exchange without both the input record and the private key – and the private key is never transmitted to Marigold. Party A can now signal to Party B “The output is message_id: xxx” available from the API.

This is not homomorphic encryption: the inference runs on plaintext. It is the closest currently practical approximation for response confidentiality at the infrastructure level. (The full context, including why fully homomorphic encryption is not yet viable for LLM inference, is in Why Private Inference Is Not Fully Private (Yet).)

Workflows and pipelines

Single model calls handle straightforward tasks. More complex automation requires composing multiple steps: embed a document, classify its content, generate a structured summary, evaluate the output. Marigold supports this through a typed workflow layer – each step conditions on the outputs of the previous one, and the whole pipeline is declared rather than hand-coded.

This is the same pattern described in the runfox workflow engine, which integrates directly with Marigold as its execution substrate.

Getting access

Marigold is available at marigold.run. API keys, documentation, and access requests are handled there. For organisations evaluating private inference infrastructure for a specific use case, get in touch.


Questions about this? Get in touch.