EDM 2026 · Short Paper

Teaching LMs How to Code Like Learners

Conversational serialization for student simulation

Charles Koutcheme

Presenter

Juho Leinonen

Advisor

Arto Hellas

Advisor

Aalto University · Espoo, Finland
Educational Data Mining · 2026

Motivation

Why build artificial students?

Models that act like learners let us test tutoring strategies and feedback policies at scale, without repeated classroom deployments.

Motivation

Prior work models what a student knows

Knowledge tracing predicts a student's first submission, to update estimates of what they know.

Motivation

We model how a student behaves

We want the other dimension: every submission within one assignment.

Even this view is naive: we still lack artificial students that solve problems like real learners.

One more gap

Realistic students need feedback

Beyond that illustration: students must also interact with feedback.

From an instructor, an AI agent, or a grader.

Infrastructure

Towards modeling student interaction with feedback

Every submission already runs through a grader.

Unit tests in, summative feedback out, already sitting in the logs.

STUDENT PROGRAM

def compute_average(nums):
    total = 0
    for i in nums:
        total += i
    return total / len(nums)

→

GRADER

runs unit tests

✓✓✗

→

SUMMATIVE FEEDBACK

2/3 tests passed
runtime error, line 4

Both submissions and feedback sit in the same logs.

Core idea

Logs → Dialogs

Serialize each trajectory as a conversation: student code = assistant turns, grader feedback = user turns.

You are a first year novice learning Python. Solve the assignment; a learning environment returns summative feedback.

env

Write compute_average(nums) returning the mean of a list with at least one number.

student

def compute_average(nums):
    total = 0
    for i in nums:
        total += i
    return average / len(nums)

env

Runtime error: undefined variable "average".

student

return total / len(nums)

env

Tests passed: 8/8 ✓

The alternating structure lets the model learn temporal dependencies of iterative debugging.

Approach

We finetune, not just prompt

Prompting proprietary LLMs means cost privacy deployment concerns for real courses.
Competency paradox: strong coders struggle to make mistakes like students.

So we finetune small, open weight models directly on the serialized logs.

Method

Training artificial students

1 · SFT

Supervised finetuning on the serialized dialogs. Loss only on student (assistant) turns. Learns problem solving patterns fast.

2 · Preference optimization

DPO (offline): contrast a student's next submission against their nearest submission with a different grade.

Contrasts drawn from the student's own trajectory teach the model why learners submit what they do.

DAPO/GRPO (online): sample candidates, reward AST and grade match; explored here.

Experiments

Setup

Data

FalconCode: CS1 Python, US Air Force Academy. Lab assignments (10 to 50 lines).
Train on Spring 2021 · test on Spring 2022.

Models

Qwen3-4B & Qwen3-8B, QLoRA on 32 GB V100s.

Variants

PARA code only, no feedback (prior work)

SFT DPO DAPO our framework

BASE untrained ICL GPT-5-mini

Experiments

How we evaluate: rollout

Starting from a real prefix, the model generates the next submission, autoregressively.

The grader evaluates it, the outcome is appended to the context, and we repeat for up to 5 steps.

history

real submissions + feedback so far

model

generates submission t+1

grader

evaluates & appends outcome

↻ repeat, up to 5 steps

We stop early if the generated program is fully correct: it passes all unit tests.

Metrics

Coverage: does the model generate a fully correct solution too early?

competency paradox

Grade Proximity: do the generated programs behave the same as the student's?

functional behavior

CodeBLEU: do the generated programs look the same as the student's?

stylistic similarity

Results

All variants, both model sizes

Model / Method	Cov	GP	CB
Qwen3-4B
BASE	0.515	0.684	0.487
PARA	0.914	0.767	0.698
SFT	0.918	0.783	0.710
DPO	0.983	0.801	0.702
DAPO	0.859	0.787	0.705
Qwen3-8B
SFT	0.921	0.795	0.722
DPO	0.954	0.801	0.718
DAPO	0.920	0.801	0.712
GPT-5-mini	0.322	0.570	0.449

Five variants, two model sizes, evaluated via rollout.

Result 1

Prompting fails to imitate learners

0.32

GPT-5-mini coverage

0.45

GPT-5-mini CodeBLEU

>0.9

any finetuned model

Large Language Models (strong coders) finish too early: low coverage.
Code doesn't resemble student submissions.
Untrained Qwen beats GPT-5-mini: small models likely make genuine programming mistakes.

Result 2

Environment feedback helps

SFT (with feedback) beats PARA (code only) on both metrics, at 4B and 8B.

Grade proximity

4B 0.767 → 0.783
8B 0.791 → 0.795

CodeBLEU

4B 0.698 → 0.710
8B 0.715 → 0.722

Result 3

Offline preference optimization wins

0.801

DPO grade proximity, best at both sizes

0.98

DPO coverage, 4B model

0.722

SFT still wins CodeBLEU

DPO wins coverage and grade proximity, both sizes.
SFT's higher CodeBLEU mostly means it stops early; DPO rolls out longer, at a small similarity cost.

In the paper

Lot more interesting results in the paper

What's next

Combine both dimensions

Model both axes together: across assignments and within each one, as a single trajectory.

Bring in richer feedback: not just the grader, but instructor or AI generated feedback too.

Current results: single dataset and model family, data skewed toward successful students, greedy decoding only.

Artificial students that debug like learners

Logs → dialogs, then SFT + preference optimization, grounded in authentic student trajectories.

Code released for reproducibility.

Thank you · questions welcome

github.com/KoutchemeCharles/
edm-conv-ser