Conversational serialization for student simulation
Aalto University · Espoo, Finland
Educational Data Mining · 2026
Models that act like learners let us test tutoring strategies and feedback policies at scale, without repeated classroom deployments.
Knowledge tracing predicts a student's first submission, to update estimates of what they know.
We want the other dimension: every submission within one assignment.
Even this view is naive: we still lack artificial students that solve problems like real learners.
Beyond that illustration: students must also interact with feedback.
From an instructor, an AI agent, or a grader.
Every submission already runs through a grader.
Unit tests in, summative feedback out, already sitting in the logs.
def compute_average(nums):
total = 0
for i in nums:
total += i
return total / len(nums)
Both submissions and feedback sit in the same logs.
Serialize each trajectory as a conversation: student code = assistant turns, grader feedback = user turns.
compute_average(nums) returning the mean of a list with at least one number.def compute_average(nums):
total = 0
for i in nums:
total += i
return average / len(nums) return total / len(nums)The alternating structure lets the model learn temporal dependencies of iterative debugging.
So we finetune small, open weight models directly on the serialized logs.
Supervised finetuning on the serialized dialogs. Loss only on student (assistant) turns. Learns problem solving patterns fast.
DPO (offline): contrast a student's next submission against their nearest submission with a different grade.
Contrasts drawn from the student's own trajectory teach the model why learners submit what they do.
DAPO/GRPO (online): sample candidates, reward AST and grade match; explored here.
FalconCode: CS1 Python, US Air Force Academy. Lab assignments (10 to 50 lines).
Train on Spring 2021 · test on Spring 2022.
Qwen3-4B & Qwen3-8B, QLoRA on 32 GB V100s.
PARA code only, no feedback (prior work)
SFT DPO DAPO our framework
BASE untrained ICL GPT-5-mini
Starting from a real prefix, the model generates the next submission, autoregressively.
The grader evaluates it, the outcome is appended to the context, and we repeat for up to 5 steps.
↻ repeat, up to 5 steps
We stop early if the generated program is fully correct: it passes all unit tests.
Coverage: does the model generate a fully correct solution too early?
competency paradoxGrade Proximity: do the generated programs behave the same as the student's?
functional behaviorCodeBLEU: do the generated programs look the same as the student's?
stylistic similarity| Model / Method | Cov | GP | CB |
|---|---|---|---|
| Qwen3-4B | |||
| BASE | 0.515 | 0.684 | 0.487 |
| PARA | 0.914 | 0.767 | 0.698 |
| SFT | 0.918 | 0.783 | 0.710 |
| DPO | 0.983 | 0.801 | 0.702 |
| DAPO | 0.859 | 0.787 | 0.705 |
| Qwen3-8B | |||
| SFT | 0.921 | 0.795 | 0.722 |
| DPO | 0.954 | 0.801 | 0.718 |
| DAPO | 0.920 | 0.801 | 0.712 |
| GPT-5-mini | 0.322 | 0.570 | 0.449 |
Five variants, two model sizes, evaluated via rollout.
SFT (with feedback) beats PARA (code only) on both metrics, at 4B and 8B.
4B 0.767 → 0.783
8B 0.791 → 0.795
4B 0.698 → 0.710
8B 0.715 → 0.722
Current results: single dataset and model family, data skewed toward successful students, greedy decoding only.
Logs → dialogs, then SFT + preference optimization, grounded in authentic student trajectories.
Code released for reproducibility.
Thank you · questions welcome