EDM 2026 · Short Paper

Teaching Language Models to Code Like Learners

Conversational serialization for student simulation

Charles Koutcheme
Charles Koutcheme
Presenter
Juho Leinonen
Juho Leinonen
Advisor
Arto Hellas
Arto Hellas
Advisor
Aalto University

Aalto University · Espoo, Finland
Educational Data Mining · 2026

Motivation

Why build artificial students?

Models that act like learners let us test tutoring strategies and feedback policies at scale, without repeated classroom deployments.

Motivation

Prior work models what a student knows

Knowledge tracing predicts a student's first submission, to update estimates of what they know.

PRIOR WORK: FIRST SUBMISSION / PROBLEMA1A2A3A4A5first submission (KT)
Motivation

We model how a student behaves

We want the other dimension: every submission within one assignment.

PRIOR WORK: FIRST SUBMISSION / PROBLEMTHIS WORK: SUBMISSIONS PER ASSIGNMENTA1A2A3A4A5first submission (KT)full trajectory (ours)

Even this view is naive: we still lack artificial students that solve problems like real learners.

One more gap

Realistic students need feedback

Beyond that illustration: students must also interact with feedback.

From an instructor, an AI agent, or a grader.

Infrastructure

Towards modeling student interaction with feedback

Every submission already runs through a grader.

Unit tests in, summative feedback out, already sitting in the logs.

STUDENT PROGRAM
def compute_average(nums): total = 0 for i in nums: total += i return total / len(nums)
GRADER
runs unit tests
SUMMATIVE FEEDBACK
2/3 tests passed
runtime error, line 4

Both submissions and feedback sit in the same logs.

Core idea

Logs → Dialogs

Serialize each trajectory as a conversation: student code = assistant turns, grader feedback = user turns.

You are a first year novice learning Python. Solve the assignment; a learning environment returns summative feedback.
env
Write compute_average(nums) returning the mean of a list with at least one number.
student
def compute_average(nums): total = 0 for i in nums: total += i return average / len(nums)
env
Runtime error: undefined variable "average".
student
return total / len(nums)
env
Tests passed: 8/8 ✓

The alternating structure lets the model learn temporal dependencies of iterative debugging.

Approach

We finetune, not just prompt

  • Prompting proprietary LLMs means cost privacy deployment concerns for real courses.
  • Competency paradox: strong coders struggle to make mistakes like students.

So we finetune small, open weight models directly on the serialized logs.

Method

Training artificial students

1 · SFT

Supervised finetuning on the serialized dialogs. Loss only on student (assistant) turns. Learns problem solving patterns fast.

2 · Preference optimization

DPO (offline): contrast a student's next submission against their nearest submission with a different grade.

Contrasts drawn from the student's own trajectory teach the model why learners submit what they do.

DAPO/GRPO (online): sample candidates, reward AST and grade match; explored here.

Experiments

Setup

Data

FalconCode: CS1 Python, US Air Force Academy. Lab assignments (10 to 50 lines).
Train on Spring 2021 · test on Spring 2022.

Models

Qwen3-4B & Qwen3-8B, QLoRA on 32 GB V100s.

Variants

PARA code only, no feedback (prior work)

SFT DPO DAPO our framework

BASE untrained ICL   GPT-5-mini

Experiments

How we evaluate: rollout

Starting from a real prefix, the model generates the next submission, autoregressively.

The grader evaluates it, the outcome is appended to the context, and we repeat for up to 5 steps.

history
real submissions + feedback so far
model
generates submission t+1
grader
evaluates & appends outcome

↻ repeat, up to 5 steps

We stop early if the generated program is fully correct: it passes all unit tests.

Metrics

Coverage: does the model generate a fully correct solution too early?

competency paradox

Grade Proximity: do the generated programs behave the same as the student's?

functional behavior

CodeBLEU: do the generated programs look the same as the student's?

stylistic similarity
Results

All variants, both model sizes

Model / MethodCovGPCB
Qwen3-4B
BASE0.5150.6840.487
PARA0.9140.7670.698
SFT0.9180.7830.710
DPO0.9830.8010.702
DAPO0.8590.7870.705
Qwen3-8B
SFT0.9210.7950.722
DPO0.9540.8010.718
DAPO0.9200.8010.712
GPT-5-mini0.3220.5700.449

Five variants, two model sizes, evaluated via rollout.

Result 1

Prompting fails to imitate learners

0.32
GPT-5-mini coverage
0.45
GPT-5-mini CodeBLEU
>0.9
any finetuned model
  • Large Language Models (strong coders) finish too early: low coverage.
  • Code doesn't resemble student submissions.
  • Untrained Qwen beats GPT-5-mini: small models likely make genuine programming mistakes.
Result 2

Environment feedback helps

SFT (with feedback) beats PARA (code only) on both metrics, at 4B and 8B.

Grade proximity

4B  0.767 → 0.783
8B  0.791 → 0.795

CodeBLEU

4B  0.698 → 0.710
8B  0.715 → 0.722

Result 3

Offline preference optimization wins

0.801
DPO grade proximity, best at both sizes
0.98
DPO coverage, 4B model
0.722
SFT still wins CodeBLEU
  • DPO wins coverage and grade proximity, both sizes.
  • SFT's higher CodeBLEU mostly means it stops early; DPO rolls out longer, at a small similarity cost.

In the paper

Lot more interesting results in the paper

What's next

Combine both dimensions

  • Model both axes together: across assignments and within each one, as a single trajectory.
PRIOR WORK: FIRST SUBMISSION / PROBLEMTHIS WORK: SUBMISSIONS PER ASSIGNMENTA1A2A3A4A5first submission (KT)full trajectory (ours)
  • Bring in richer feedback: not just the grader, but instructor or AI generated feedback too.

Current results: single dataset and model family, data skewed toward successful students, greedy decoding only.


Artificial students that debug like learners

Logs → dialogs, then SFT + preference optimization, grounded in authentic student trajectories.

Code released for reproducibility.

Thank you · questions welcome

github.com/KoutchemeCharles/
edm-conv-ser