To briefly summarize the last three posts in this series on “GroupSort”, in Part 1, we proposed the problem, and came up with a brute-force algorithm to come up with the ideal solution (I’ve previously called it “the naive algorithm”, but I feel this is a little misleading – “brute force” is a better description). We found that this algorithm worked well (duh) for small values, but for real-world class-sizes — as small as about 20 people — it would simply never compute in a reasonable time. So in Part 2, I sketched out the “team-captain algorithm” that I found was a good compromise of speed and a “good enough” solution. Then in Part 3, we performed some analysis, comparing the two approaches, to see if we could get a good numerical understanding of just how “good enough” our “good enough” solutions were. (Answer: good enough)
Today I wanted to pause and restate the problem, and sprinkle in some computer-science theory that I’ve since learned about this particular problem. I feel like my diving into code in the first few entries may have been a little unapproachable to a non-programmer, but there’s really no reason this very real-world problem shouldn’t be understandable by anyone.
The Problem Again
In the form of a “user story”, a teacher has a class of ~30 students. They regularly group the students together in groups of varying sizes, but likely from ~4 to ~8. The teacher spends a lot of time manually making groups, because they want to avoid the same people being grouped together all the time; they want to maximize the diversity of the groups. We want to write a program to do this automatically. It will require us to keep track of the history of groupings, and use this data to inform how best to make new groups.
Hey, here’s a picture!
This is a simple, but not-totally-trivial example of a class of 6 students. Let’s say we want to make 2 groups of 3, and maximize the pairing diversity. Well, if we’re doing this for the first time, that’s easy; there’s no wrong answer. They have no history, so any 2 groups of 3 students will have a historical overlap of zero.
But let’s say we’re part way into the year, and a few groupings have already been made so far. If we draw a line to connect each student, and label it with how many times those two students have been paired together, it might look something like this:
Yikes! This is a lot of data for a small picture. I’m gonna introduce some terminology, then I’ll clean it up a little.
In computer science, this diagram resembles what’s called a “graph”. It’s a particular data type of interest. Each student (or their head) is called a node, and the line that tells us how many times those two students have been in the same group, that’s called an edge. (In many graphs, edges have no value; they are either there or they’re not. Our edges, which have a value, can be called weighted edges.)
To clean this up, let’s get rid of any edges with a value of 0, and we’ll color-code the remaining edges. Let’s say yellow is a 1-weighted edge, and orange is a 2-weighted edge.
A bit better, huh? Now we can glance at the image and get an idea of who has been paired with who more frequently, and who hasn’t been paired with who at all! For those less visually inclined, our code has been representing this data as a hash table – the pair of student IDs is the key, and the historical count of pairings is the value. If we number off the students clockwise from the top, the graph above could be represented as the following hash table:
(1, 2): 1
(1, 3): 2
(1, 4): 2
(2, 3): 2
(2, 4): 1
(2, 5): 2
(2, 6): 2
(3, 4): 1
(3, 6): 1
(4, 5): 1
(5, 6): 2
(Where, again, we just drop any zero-value pairings)
Now, in Part 1 of this series, I glibly refer to our problem as being “NP-Complete”. Without getting too technical in explaining what this means (and certainly not simply because I don’t fully understand it myself), for our purposes we can say that solving this problem for numbers above 10 students becomes exponentially more difficult, to the point where we can’t realistically solve it, even in a number of days, if our values get too high.
Have you heard of cliques? Well, in our problem, the computer-science meaning of the word “clique”, and the common understanding of that word are very similar! The clique problem is an NP-complete problem. It’s the problem of trying to find, from a given graph, the largest interconnected set of nodes. When a set of nodes (or students) is fully interconnected, they’re called a clique.
Well, our problem is a little different. We’re actually looking for something like the opposite. We want to use the edge-weighting, to find the set of groups of students with the smallest overall edge sum. This is a class of edge-weighted clique problem. That’s a term you can look up! Much greater minds than mine have written much on this particular class of problem, so again, I’m fairly satisfied to have a good-enough solution that works fast. Fast enough for a realtime web or mobile app anyway.
In conclusion, that’s me once again basking in the sheer interestingness of this particular problem. I hope it sheds a little light on my approach in previous entries to solving it!
Looking ahead, I want to make a few simple tweaks to my algorithm, then approach turning this abstract & very simple program into a keen web application that a teacher could actually use! Join me then, won’t you?