TL;DR: We’re looking for an AI-native intern or working student excited to experiment with new prompt/context engineering methods, burn tokens, build benchmarks, and set up evals; all to push the performance of our GenAI and agentic capabilities.
Your responsibilities
- Establish and Execute the End-to-End Evaluation Workbench: Design, implement, and manage the full prompt evaluation lifecycle, including prompt development, test batch definition, model execution, and rigorous performance scoring using established metrics.
- Develop and Refine Robust Evaluation Metrics and Frameworks: Select, implement, and potentially create custom quantitative and qualitative metrics (Evals) tailored to objectively measure the quality, accuracy, and desired behavior of model outputs against defined benchmarks.
- Systematically Drive Prompt Optimization (Closed-Loop): Conduct in-depth analysis of evaluation results, identify performance gaps, and iteratively refine and update prompts (e.g., system prompts, few-shot examples, chain-of-thought) to maximize model performance and maintain alignment with goals.
- Pioneer Automation for Continuous Improvement: Contribute to the design and initial implementation of a closed-loop feedback system with the long-term goal of autonomously updating prompts and/or model parameters based on evaluation outcomes to consistently drive accuracy.
- Document and Communicate Key Performance Insights: Clearly document the evaluation methodology, results, prompt versions, and provide actionable recommendations and conclusions to stakeholders, focusing on the trade-offs and path to optimal performance.
Your Profile
- Academic Background: Currently pursuing a Bachelor’s or Master’s degree in Data Science, Computational Linguistics, Artificial Intelligence, Computer Science, or a closely related quantitative field.
- Technical Proficiency: Strong proficiency in Python is mandatory, including experience with data manipulation libraries like Pandas and NumPy, and ideally, experience utilizing modern LLM frameworks/libraries (e.g., Hugging Face, OpenAI API/SDKs, LangChain, LlamaIndex).
- Prompt Engineering & LLM Knowledge: Demonstrated foundational understanding of Large Language Models (LLMs), including the principles of prompt engineering (e.g., few-shot, chain-of-thought) and different LLM evaluation techniques (e.g., human-in-the-loop vs. automated evals).
- Analytical & Problem-Solving Aptitude: Proven ability to approach complex, unstructured problems (like performance comparison and optimization) with a systematic, data-driven methodology; capable of translating evaluation results into concrete prompt refinements.
- Logistical Requirement: Must be able to work on-site in Zurich, Switzerland for the duration of the internship.
What We Offer
- Direct impact, working with clients, and shipping fast
- Significant equity on top of salary compensation
- Working closely with the founders who've built and scaled fintech and analytics products before
- Development budget, free to use for whatever gives you a level-up
- Regular team activities and dinners
- Office in the heart of Zurich, new Macbook Pro, and more