# LLMOps Operations Competence Center Switzerland

> LLM operations consulting in Switzerland. VSHN deploys and operates vLLM, LiteLLM, and llm-d on Kubernetes. Swiss data residency, ISO 27001 certified.


VSHN has been the DevOps company since 2014 - now bringing that same operational expertise to LLM and AI workloads. We offer consulting and solution design for Swiss enterprises deploying Large Language Models: architecture review, inference stack selection (vLLM, LiteLLM, llm-d), and production operations on Kubernetes, OpenShift, or sovereign cloud infrastructure. No off-the-shelf product - every engagement starts with your requirements and results in a solution built for your use case.


## Pages

- [Homepage](https://www.llmops.ch/): LLMOps Consulting Switzerland – LLM Infrastructure | VSHN
- [Partner with VSHN on LLMOps | VSHN](https://www.llmops.ch/partners.md)
- [LLMOps Sovereignty — CLOUD Act-Free AI | VSHN](https://www.llmops.ch/sovereignty.md)

## Features

- **Architecture Review and Stack Selection**: VSHN engineers assess your inference requirements, model sizes, and compliance constraints, then recommend the right serving stack. We work with [vLLM](https://www.vllm.ch) for high-throughput single-node inference, [LiteLLM](https://www.litellm.ch) for multi-provider API gateways, and [llm-d](https://www.llm-d.ch) for distributed disaggregated serving. Each project starts with a written architecture document covering GPU sizing, cluster layout, and cost projections.

- **Inference Deployment on Kubernetes**: VSHN deploys and operates your chosen inference stack on production Kubernetes clusters. We configure GPU scheduling with NVIDIA device plugins, resource quotas, horizontal pod autoscaling based on queue depth, and rolling model updates. Deployments run on APPUiO, Red Hat OpenShift, or your existing Kubernetes platform - on-premises or on Swiss cloud providers like Exoscale and Cloudscale.

- **Multi-Model Routing and API Management**: Route requests across multiple models and providers through a single API endpoint. VSHN deploys LiteLLM-based gateways with per-team budget controls, rate limiting, and failover routing. Your applications get one stable endpoint while VSHN handles provider configuration, API key rotation, and cost tracking across commercial and self-hosted models.

- **Sovereign Cloud and Swiss Data Residency**: Run LLM workloads on sovereign cloud partners that guarantee full data sovereignty and regulatory compliance. VSHN operates across Swiss and European sovereign cloud providers, ensuring your models, prompts, and training data never leave trusted jurisdictions -- critical for financial services, healthcare, and government use cases. Open-source models with full training transparency, such as Apertus, additionally satisfy EU AI Act Art 53 provisions for high-risk deployments. Learn more in our [sovereignty assessment](/sovereignty/).

- **Observability and Cost Control**: Monitor latency, throughput, token usage, and infrastructure costs across your entire LLM fleet. VSHN integrates Prometheus, Grafana, and custom dashboards into your platform so you always know what your models cost to run, where bottlenecks are, and when to scale up or down. Alerting rules notify your team and our 24/7 operations center when metrics breach thresholds.

- **Vector Database and RAG Pipelines**: Build retrieval-augmented generation pipelines with managed vector stores running alongside VSHN's Application Catalog databases. PostgreSQL with pgvector, dedicated search indices, and automated backups - all operated on Swiss infrastructure with the same SLA guarantees as your other VSHN-operated databases. VSHN handles the infrastructure so your data team can focus on retrieval quality.

- **vLLM: High-Throughput Inference**: [vLLM](https://www.vllm.ch/) uses PagedAttention to serve open-source models like Llama, Mistral, and Apertus with up to 23x higher throughput than naive implementations. VSHN deploys and operates vLLM on GPU-equipped Kubernetes clusters in Swiss data centers, with autoscaling and monitoring built in.

- **LiteLLM: Unified AI Gateway**: [LiteLLM](https://www.litellm.ch/) routes requests to 100+ LLM providers through one API endpoint. VSHN configures per-team budget controls, rate limiting, failover routing, and audit logging - so your teams get a single stable API while you maintain cost visibility and Swiss data residency.

- **llm-d: Distributed Inference**: [llm-d](https://www.llm-d.ch/) disaggregates prefill and decode phases across multiple GPUs for large-scale serving. VSHN architects and operates llm-d deployments on Kubernetes, with scheduling optimised for latency-sensitive and throughput-heavy workloads.


## LLMOps Consulting FAQ

### What LLM platforms and technologies does VSHN work with?

VSHN deploys and operates LLM workloads on APPUiO (our managed Kubernetes platform), Red Hat OpenShift, enterprise private cloud infrastructure, and sovereign cloud partners. We work with [vLLM](https://www.vllm.ch/) for high-throughput inference, [LiteLLM](https://www.litellm.ch/) for multi-provider gateways, and [llm-d](https://www.llm-d.ch/) for distributed serving. All platforms run on Swiss or European data centers with up to 99.99% uptime SLA.


### Which cloud providers does VSHN operate on?

VSHN operates on multiple Swiss cloud providers including Exoscale and Cloudscale, as well as European sovereign cloud partners. For organizations that need GPU-accelerated workloads, we work with providers offering GPU instances in Swiss data centers on public and private cloud. All infrastructure is managed under a single SLA with 24/7 support from our operations team.


### How does VSHN scope and quote LLMOps consulting engagements?

Every engagement starts with a free architecture consultation where we assess your model serving needs, GPU requirements, and compliance constraints. VSHN then delivers a written scope document with a fixed-price or time-and-materials quote in CHF. Typical engagements cover cluster design, inference stack deployment, observability setup with Prometheus and Grafana, and backup automation for model artefacts and configuration data. There is no commitment at the scoping stage.


### How does VSHN handle GPU scheduling and scaling?

We configure Kubernetes GPU scheduling with NVIDIA device plugins, resource quotas, and pod priority classes so your inference workloads get the GPU time they need. Horizontal pod autoscaling adjusts replica counts based on request queue depth or latency targets. For batch training jobs, we set up preemptible scheduling to optimise cost without blocking interactive inference.


### Can VSHN set up vector databases for RAG pipelines?

Yes. VSHN operates PostgreSQL with the pgvector extension as a fully operated database through our Application Catalog. You get automated daily backups, point-in-time recovery, high-availability replicas, and up to 99.99% SLA as all our managed database services. We also support dedicated search indices for hybrid retrieval workflows.


### How does VSHN ensure data sovereignty for LLM workloads?

All infrastructure runs in Swiss data centers operated by Swiss or European sovereign cloud providers. Training data, model weights, vector embeddings, and inference logs never leave the chosen jurisdiction. All operational access is from Switzerland-based engineers, and we provide audit trails for compliance reporting. See our [sovereignty assessment](/sovereignty/) for details on how VSHN scores against the EU Cloud Sovereignty Framework.


### Does VSHN support open-source and commercial LLM models?

We support both. For open-source models such as Llama, Mistral, Falcon, and Apertus (the Swiss AI foundation model, Apache 2.0 licensed and EU AI Act compliant), we deploy Kubernetes-native serving infrastructure with [vLLM](https://www.vllm.ch/) or [llm-d](https://www.llm-d.ch/). For commercial APIs like Anthropic Claude or OpenAI, we set up [LiteLLM](https://www.litellm.ch/) gateways that route requests through Swiss infrastructure while enforcing budget controls and audit logging. VSHN handles the infrastructure layer so your data science team can focus on model quality.


### What monitoring and observability does VSHN provide?

VSHN integrates Prometheus and Grafana into every managed platform, with custom dashboards for LLM-specific metrics: inference latency (p50, p95, p99), tokens per second, GPU utilisation, queue depth, and estimated cost per request. Alerting rules notify your team and our 24/7 operations center when metrics breach thresholds, so performance issues are caught before they affect users.


### Can VSHN help if we don't have our own GPU infrastructure?

Yes. Many of our LLMOps engagements start with organisations that have a model or use case but no GPU infrastructure. VSHN provisions GPU-equipped Kubernetes clusters on Swiss cloud providers, deploys your inference stack with vLLM or llm-d, and operates it 24/7. You bring your model requirements; we handle the infrastructure from provisioning through day-two operations. For smaller workloads, shared GPU options let you start without committing to dedicated hardware. Contact us for a free architecture consultation.


### How do I get started with VSHN's LLMOps consulting?

Contact us through the form below for a free initial consultation. VSHN has operated production Kubernetes platforms since 2014 for banks, insurers, and SaaS companies across Switzerland - LLM operations builds on that same infrastructure and team. We assess your workloads, platform requirements, and compliance constraints, then propose an architecture with a clear scope and pricing. Most customers go from consultation to a running production platform in four to six weeks, with ongoing 24/7 operations support.


### Can consulting firms partner with VSHN for client LLM projects?

Yes. Consulting firms and AI agencies regularly partner with VSHN to deliver LLM infrastructure for their clients. You bring the model selection and application design; VSHN provisions and operates the GPU infrastructure, inference stack, and observability on Swiss cloud. Each client engagement runs on isolated infrastructure with dedicated resources. This lets your team focus on the AI solution while VSHN handles the operational complexity of running LLM workloads in production.


## Book an LLMOps consultation

Tell us about your LLM workloads and infrastructure requirements. VSHN provides a free initial consultation covering architecture review, stack selection, and a scoped proposal for your deployment.


## Architecture

- **Image**: llmops_ch/llmops-architecture.svg
- **Alt**: LLMOps architecture diagram showing LiteLLM gateway routing to vLLM, llm-d, and commercial APIs on Kubernetes