Equiteams

2026-03-08

How it started

This started because my lecturer posted thesis topics on a public spreadsheet. Rafif spotted this one and asked if I wanted in, and I said yes with only a vague idea of what AI for Education was supposed to involve. I assumed I was about to do some real AI work. At that point I had never even touched PyTorch. Most of my work went into the application around an existing optimizer: onboarding, forms, assignment setup, dashboards, exports, and the integration layer that made Edu2Com usable in a real class. I built the application. Rafif handled the whole UI/UX side and all of the mascot work, and we spent a lot of days working together in Figma.¹

What goes into the payload

Each student contributes four kinds of input:

Skills: self-assessed per assignment, normalized to [0, 1]
Personality: four scores in [-1, 1] from a 41-item questionnaire
Topic preferences: optional rankings, also normalized to [0, 1]
Gender: optional person attribute

The questionnaire is MBTI-shaped because that is what the Edu2Com contract can accept: four axes, each collapsed to a number between -1 and 1.² In the app, the question order is randomized, there is one attention-check item, and submissions that come in too quickly are rejected instead of being treated as serious input.

Edu2Com exposes four weights:

Weight	What it controls	Default
α	Skill matching	0.4
β	Personality	0.3
γ	Student-to-student preferences	0.2
δ	Task preferences	0.1

In my app, α, β, and δ are active. γ exists in the API and in the database schema, but the submission flow does not collect person-to-person affinity yet, so topic rankings are the preference signal that actually matters. Gender gets sent through as part of the person object, but the public API does not expose a separate gender weight.

From the papers, Edu2Com is an anytime algorithm: it builds an initial team assignment, then keeps improving it with swaps instead of brute-forcing every possible combination. That part is research. The part I got to know very well was its behavior at the boundary.

The personality result I did not expect

β was where things got weird.

I went in assuming personality compatibility would reward similarity. The live API kept pointing the other way.

The cleanest test I ran used a single two-person task. Both students had the same skill profile. All weights were set to 0 except β = 1, and initRandom was disabled so the result stayed deterministic.

Two students with identical personality vectors scored 0.3225
Two students with exact opposite vectors scored 1.32

That second number matters for two reasons. First, it is much higher. Second, it exceeds 1.0, which means the raw team quality coming back from Edu2Com is not reliably normalized even when the weights look tidy on the way in. The app deals with that by normalizing weights before background requests and clamping returned quality values before showing them.

Single two-person task, identical skills, β=1.0, all other weights 0. Raw response values: 0.3225 vs 1.32.

I do not have Edu2Com’s source code, so I cannot claim to know its exact internal compatibility function. But from the public interface, personality looks much closer to distance than to sameness. The behavioral integration tests in the repo already showed that changing β changes team composition. The simple API call above made the direction obvious.

That also matches the way the rest of the service feels: it was built from a research perspective, where “diverse teams may work better” is a reasonable thing to encode directly into the objective function.

Spotlight: the part around the API call

The API call itself is one POST request. The annoying part was everything needed to make that request actually work.

Before team formation starts, the app filters out students who never submitted the assignment form or who are missing personality scores. Skills are collected from the union of submitted student records. If someone has no skill rows at all, the payload gets a fallback zero-level skill so the downstream service does not quietly drop that student.

Group sizing also needed its own guardrails. Lecturers can pick either the number of groups or the number of students per group. Both sound simple until you start distributing remainders and rejecting configurations that produce one-person teams.

Topics were the ugliest translation layer. Edu2Com wants tasks. A real class has assignment topics, and the topic count often does not match the number of final groups. When the numbers line up, each group can map directly to a single topic. When they do not, the app buckets topics round-robin across candidate groups, averages the preferences inside each bucket, and uses the strongest-supported topic as that bucket’s representative task.

The synchronous endpoint held up for small cohorts and then started falling off a cliff. In my tests:

4 students took about 1.4 seconds
20 students took about 13.8 seconds
40 students took about 51.4 seconds
60 students timed out after about 60.9 seconds

That is why the app uses Edu2Com’s background endpoint. A team-formation request gets stored first, the callback URL is signed with HMAC, retry logic kicks in for 408/429/5xx responses, and stale requests get auto-failed if a new formation attempt starts later. When the webhook comes back, the app validates the payload, clamps out-of-range quality values, saves the teams in one Prisma transaction, and appends any unassigned students to the smallest teams instead of letting them disappear.

Without that plumbing, the app is just a form wrapped around a timeout.

Limits

The output is only as good as the inputs. Skill levels are self-reported, which means some students oversell themselves, some undersell themselves, and some click through. The optimizer is still working over declared ability, not proven ability.

The whole system also depends on someone else’s server. If Edu2Com is down, Equiteams can still collect data and manage the class, but it cannot form teams. That is acceptable for a thesis project or a pilot deployment. It would need a local implementation before I would trust it as core infrastructure.

Even with those limits, the app works. Lecturers can set up a class, define an assignment, collect responses, and get a team split that follows a visible set of rules instead of whatever happened in the group chat that week.

Lessons learned

Integration is the hard part. The optimizer already existed. The real work was translating between what it expects and what real users provide: normalizing inputs, handling mismatched counts, surfacing errors that actually explain what went wrong.
Background jobs need first-class treatment. My first attempt used the synchronous Edu2Com endpoint and it fell apart at 40+ students. Switching to webhooks with HMAC signing, retry logic, and stale-request cleanup was the single biggest improvement to reliability.
Self-reported data is noisy. Attention checks and timing guards helped, but I’d still want a way to cross-validate skill claims in a future version.
Working closely with a designer changed everything. Having Rafif own the UX direction while I focused on implementation meant we could move fast without either side being a bottleneck. I learned more about translating Figma into real components on this project than on anything before it.
Ship it, then fix it. The codebase has thesis-era seams. Some parts are solid, some parts are held together by deadline energy. But it works, and that matters more than perfect abstractions.

Equiteams demo ↗click here, internet traveler

Rafif owned the UI/UX direction, the color palette, layouts, onboarding flow, and all of the mascot work. Most days we were in Figma together, and then I turned that into the frontend and interaction layer. ↩
MBTI has well-known reliability problems, so I would not treat it as a deep statement about personality. It fit the API contract, and the app was built around that constraint. ↩