Senior AI DevOps / LLMOps

TechBiz Global GmbH · Baden-Baden

On-sitePosted today

Skills

PhoenixKubernetesAWSAzureTerraformAnsibleGitLab CIGitHub ActionsCI/CDDevOpsSREGitHubGitLabLLM

Job description

At TechBiz Global, we are providing recruitment service to our TOP clients from our portfolio. We are currently seeking an Senior AI DevOps / LLMOps specialist to join one of our clients ' teams. If you're looking for an exciting opportunity to grow in a innovative environment, this could be the perfect fit for you. Key Responsibilities Automation of Build-to-Production - Design and implement robust CI/CD pipelines tailored for AI, covering model weights, dataset versioning, and application code. - Develop specialized workflows for PromptOps, ensuring that system prompts are version-controlled, tested for regressions, and deployed with the same rigor as traditional code. -Automate the deployment of Agentic workflows, managing the complexities of stateful AI interactions and multi-agent handoffs. 2. AI Infrastructure as Code (IaC) - Provision and manage high-performance compute environments (GPU clusters, TPU pods) using Terraform, Pulumi, or Ansible. - Define and enforce Policy-as-Code for AI endpoints to ensure compliance with security, cost-usage limits, and data residency requirements. - Maintain a consistent environment across Hybrid Infrastructure, ensuring seamless parity between On-Premises development and Cloud production. 3. Safe Experimentation & Controlled Releases - Architect Progressive Delivery strategies for AI, including Canary releases, Blue-Green deployments, and Shadowing (where new models run in parallel with production to compare outputs). - Build “Evaluation-in-the-Loop” gates within the pipeline to automatically test for bias, hallucination, and performance degradation before a release. - Implement A/B testing frameworks specifically designed for LLM outputs and agentic behavior. 4. Monitoring & Observability - Establish deep observability into Inference Endpoints, tracking metrics like tokens-per- second, latency, and drift in model accuracy. -Integrate feedback loops that capture production “edge cases” to feed back into the training and fine-tuning pipelines. Must-Have Technical Skills: -Orchestration: Advanced Kubernetes (K8s) skills, specifically with KubeFlow, Ray, or NVIDIA Triton. -CI/CD & IaC: Expertise in GitHub Actions/GitLab CI, and Terraform or Pulumi. - AI Tooling: Experience with Weights & Biases, MLflow, LangSmith, or Arize Phoenix. -Hardware: Understanding of GPU virtualization, CUDA drivers, and on-premises hardware management. -Security: Familiarity with Open Policy Agent (OPA) and secret management (Vault). Experience: - 10+ years in DevOps, SRE, or Cloud Engineering. - 2+ years of hands-on experience in MLOps or LLMOps, specifically moving LLMs from notebook to production. -Proven experience managing Hybrid Cloud environments (e.g., AWS/Azure + Private Data Center). Find more English Speaking Jobs in Germany on Arbeitnow

Explore more