The Math Behind Hanzimatic's Vocabulary Test

Introduction

Hanzimatic is a Mandarin learning website. It keeps track of the words that you know, and can generate flashcards suited for you by teaching you new sentences containing only one unknown word.

I made Hanzimatic primarily for my own Chinese studies, but would love for others to use it! In order to do this, Hanzimatic needs a quick way to estimate the words that new users already know, so they can jump in to using the site. I wanted to make a test for folks to get started. Ideally, this test would be:

fast: not quiz learners for too long, so that they can get started learning new words quickly
adaptive: quiz learners with words that they have a decent chance of knowing, so that they’re not intimidated by complex words or bored by easy ones.

Turns out there’s quite a bit of interesting maths behind doing this well! This blog post documents my explorations developing this test using my own data from 2+ years of Mandarin language learning.

Fitting curves

This study says that the relationship between word frequency and likelihood of knowing the word follows a logistic curve:

Graph showing logistical curve of word frequency to likelihood of knowing the word — Word frequency versus probability of being known

Can I check whether this is true for the words I know? I collected all of the words that I’ve added to Hanzimatic and plotted them.

known_words = Preferences.objects.filter(id=1).first().known_words_and_cards()
df = pd.DataFrame(
    [(w.frequency, int(w.text in known_words)) for w in Word.objects.all()],
    columns=['Frequency', 'Known'])
df.plot(kind='scatter', x='Frequency', y='Known', s=32, alpha=.8)

The scatterplot shows all of the words in the dictionary, with whether I know them or not plotted along the y-axis. The x-axis here is the log-frequency of the word.

What can we tell from looking at this graph?

I know most of the very-frequent words: there are no unknown words with frequency above 10 or so.
There’s no such cutoff at the low end: I know quite a few words with low frequency.¹

What does it look like if we try to fit a logistic curve to this, to try to predict whether I know a word given its frequency?

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
x_data = df['Frequency'].values.reshape(-1, 1)
y_data = df['Known'].values
model.fit(x_data, y_data)

Plotting this, we get:

This looks roughly the right shape. But is it a good predictor? We can measure the log-loss and compare it against a very simple model that just predicts the fraction of words that I know:

base_frequency = df['Known'].mean()
l_fit_data = np.full(x_data.shape, base_frequency)
m_fit_data = model.predict_proba(x_data.reshape(-1, 1))[:, 1]
print(f"y-intercept model Log Loss: {linear_log_loss}")
print(f"Logistic model Log Loss: {model_log_loss}")

Comparing logistic and flat linear models

The log-loss of the logistic model is $ 0.17 $ and the log-loss of the flat y-intercept model is $ 0.24 $. Looks like the logistic method does better – good!

Now that we know how to fit a logistic model to known and unknown words, let’s return to the original question: How do we make a test that estimates the number of words that learners know?

Making it interactive

We make a test that does the following:

Select a word that we reckon has a 50% chance of being known, and ask the user if they know it
Fit a logistic model to the words that we’ve tested so far.
Repeat until we have a good idea of the words that they know.

You can try this test here. Give it a go!

Here are the words that it picked for me. The y-axis represents how many words the test thinks I know before testing me with each word.

Estimated words before each test question

You can see that the oscillations in the estimate are initially large, reducing over time as it gets an idea of the words that I know. When I get a streak of words correct, it “rewards” me with progressively harder words until it finds one that I don’t know. The estimate converges around 2500, which is a pretty good estimate of the words that I can read and recognize on sight. I’ve added ~6000 flashcards to Hanzimatic, but recognizing words in the context of a sentence is very different to reading them in isolation.

I played around with the number of questions in the test and found that 50 questions seemed to be a good number. At this point, the estimated number of known words only changes by 50-100 with each new question.

Final thoughts

I was happy with how quickly this test can come up with a good estimate! It takes me only about 5 minutes to run through it. But…
50 questions is a long time to hold someone’s attention, especially if they’re using the site for the first time. I might have better luck just asking new folks to select “beginner” or “intermediate”, and select a preset number of the most frequent words – but then I wouldn’t have so much fun with statistics.
I can imagine using more information than just word frequency to detect words that someone is likely to know. For example, if someone knows the words “木星” (Jupiter) and “天王星” (Uranus), they are likely to know “冥王星” (Pluto). I could use something like word2vec to select better words for testing.

You can try this test here. Contact me at [email protected] if you have thoughts!

This might be because I learn the “fun” words as well as the actually-useful ones. I’m planning a future blog post on this! ↩︎

Introduction#

Fitting curves#

Making it interactive#

Final thoughts#

Introduction

Fitting curves

Making it interactive

Final thoughts