Community trains Gemma to reason with Tunix and TPUs

TL;DR

Developers used Kaggle TPUs to transform Gemma models into reasoning-capable systems via structured training pipelines.

Key points

1
G-RaR: Rubric-Based Reinforcement Learning: The winning solution (G-RaR) trains Gemma models to output structured reasoning traces inside <reasoning> tags before answers. It uses a Gemma-3-12B judge model to evaluate task-specific rubrics, converting discrete scores into continuous rewards. This allows models to improve reasoning without relying on exact correctness, making it effective for open-ended tasks. For example, a medical model could generate step-by-step clinical reasoning traces instead of guessing. Developers should implement this by first fine-tuning Gemma-2-2B on 33k samples to establish the <reasoning>/<answer> structure, then using GRPO with a composite reward function (format + exact answer + rubric score) on a single Kaggle TPU v5e-8 for 9 hours.
2
Pinocchio-1B: Three-Stage Reasoning Pipeline: Pinocchio-1B achieves reasoning in 9 hours using SFT (distillation from OSS-120B teacher), SimPO (enforcing XML formatting), and GRPO (refinement via Gemini 2.0 Flash judge). This pipeline shifts from basic pattern matching to logical deduction by first teaching Chain-of-Thought, then locking in strict formatting to prevent verbosity, and finally refining logic with real-time accuracy checks. Developers can replicate this by training on 70k prompts with a task-router, using SimPO to replace memory-heavy DPO for efficient XML output, and then applying GRPO with asynchronous evaluation to penalize hallucinations. In robotics, this could enable a model to plan multi-step tasks without overcomplicating the response.
3
IDEA-E: Curriculum-Guided GRPO for Ethics: IDEA-E distills ethical reasoning into a 2B model using curriculum-guided GRPO and TF-IDF rewards. The structured approach forces step-by-step deduction, while TF-IDF prevents verbose responses by rewarding context-relevant vocabulary. Developers should fine-tune on teacher data to establish the IDEA-E format, then use GRPO with curriculum guidance to prioritize logical steps over guessing. This is especially useful for medical applications where models must avoid premature conclusions—like generating a step-by-step diagnosis trace instead of a single answer.
4
Domain-Specific Reasoning Applications: Winning submissions demonstrated practical reasoning in medical, chemistry, legal, and robotics fields. In healthcare, GRPO models generate interpretable clinical reasoning traces for complex cases. For chemistry, step-by-step traces help small models solve molecular problems. Legal applications use structured reasoning to analyze case law without hallucinations. Robotics projects leverage multi-step planning within single-session training. Developers should start with a domain-specific dataset (e.g., medical symptoms) and use GRPO to refine outputs into actionable steps—like breaking down a legal query into evidence gathering, analysis, and conclusion phases.

Read the original on Google Search Central

Share this update

This is a summary of an official post from the Google Search Central Blog, provided for quick reading. Google and the Google logo are trademarks of Google LLC; My Tool Studio is not affiliated with Google. Always refer to the original announcement for authoritative guidance.