Role Summary
We are seeking a highly skilled and adaptable Principal Software Engineer to lead platform engineering initiatives focused on Kubernetes-based infrastructure and AI coding tool integration. This role will drive the design and development of a custom Kubernetes development platform, build critical tooling for scalable deployment, and contribute to the evolution of AI agent capabilities. You'll collaborate closely with development, operations, and AI teams, providing technical leadership and hands-on support across the platform lifecycle.
This is 100% remote primarily supporting Eastern work hours.
Key Responsibilities
- Lead design and architecture efforts in collaboration with client leads.
- Develop a custom-built Kubernetes development platform, including writing and reviewing technical design documents.
- Build and maintain a Kubernetes Operator to:
-
- Orchestrate upgrades and workload clusters.
-
- Automate deployment of new workload clusters.
- Support internal development teams by resolving platform-related inquiries and guiding best practices.
- Champion platform engineering best practices including CI/CD, observability, and infrastructure as code.
- Read and debug unfamiliar codebases and systems with minimal guidance.
- Mentor junior engineers and foster a culture of technical excellence and continuous learning.
AI Specific Experiences
- Integrate and extend AI coding tools (e.g., GitHub Copilot, Claude Code) into development workflows.
- Design and implement agent tools (e.g., LangChain Tools, Cloud Code Agent Skills) to enhance AI capabilities.
- Understanding of Model Context Protocol (MCP) server with dual support for MCP and REST API formats.
- Address security concerns such as prompt injection and other vulnerabilities in AI-integrated systems.
Required Skills & Experience
- Proficiency in Python, Go, and Node.js.
- Deep experience with Kubernetes or similar technologies (e.g., Docker Compose, Docker Swarm, AWS ECS).
- Strong understanding of container image development using Docker or Podman.
- Experience with GitHub Actions for CI/CD workflows.
- Familiarity with observability tools and concepts, especially Prometheus and OpenTelemetry.
- Experience implementing feature flags in distributed systems.
- Hands-on experience with AI coding tools and agent tool development.
- Knowledge of MCP server architecture and security best practices.
- Exceptional reading comprehension and ability to work outside your comfort zone:
-
- Navigating unfamiliar codebases.
-
- Debugging complex systems without prior exposure.