Skip to content(if available)orjump to list(if available)

Show HN: Mandarin Word Segmenter with Translation

Show HN: Mandarin Word Segmenter with Translation

12 comments

·February 4, 2025

I've built mandoBot, a web app that segments and translates Mandarin Chinese text. This is a Django API (using Django-Ninja and PostgreSQL) and a NextJS front-end (with Typescript and Chakra). For a sample of what this app does, head to https://mandobot.netlify.app/?share_id=e8PZ8KFE5Y. This is my presentation of the first chapter of a classic story from the Republican era of Chinese fiction, Diary of a Madman by Lu Xun. Other chapters are located in the "Reading Room" section of the app.

This app exists because reading Mandarin is very hard for learners (like me), since Mandarin text does not separate words using spaces in the same way Western languages do. But extensive reading is the most effective way to learn vocabulary and grammar. Thus, learning Mandarin by reading requires first memorizing hundreds or thousands of words, before you can even know where one word ends and the next word begins.

I'm solving this problem by allowing users to input Mandarin text, which is then computationally segmented and machine translated by my server, which also adds dictionary definitions for each word and character. The hard part is the segmentation: it turns out that "Chinese Word Segmentation"[0] is the central problem in Chinese Natural Language Processing; no current solutions reach 100% accuracy, whether they're from Stanford[1], Academia Sinica[2], or Tsing Hua University[3]. This includes every LLM currently available.

I could talk about this for hours, but the bottom line is that this app is a way to develop my full-stack skills; the backend should be fast, accurate, secure, well-tested, and well-documented, and the front-end should be pretty, secure, well-tested, responsive, and accessible. I am the sole developer, and I'm open to any comments and suggestions: roberto.loja+hn@gmail.com

Thanks HN!

[0] https://en.wikipedia.org/wiki/Chinese_word-segmented_writing

[1] https://nlp.stanford.edu/software/segmenter.shtml

[2] https://ckip.iis.sinica.edu.tw/project/ws

[3] http://thulac.thunlp.org/

rahimnathwani

This is cool. If you haven't already, you might like to take a look at Du Chinese and The Chairman's Bao. They might provide ideas or inspiration.

Also the 'clip reader' feature in Pleco is decent.

Also, supporting simplified as well as traditional might increase your potential audience.

imron

Nice work OP.

I’ve done a fair amount of Chinese language segmentation programming - and yeah it’s not easy, especially as you reach for higher levels of accuracy.

You need to put in significant amounts of effort just for less than a few % point increases in accuracy.

For my own tools which focus on speed (and used for finding frequently used words in large bodies of text) I ended up opting for a first longest match algorithm.

It has a relatively high error rate, but it’s acceptable if you’re only looking for the first few hundred frequently used words.

What segmented are you using, or have you developed your own?

routerl

Thanks for the kind words!

I'm using Jieba[0] because it hits a nice balance of fast and accurate. But I'm initializing it with a custom dictionary (~800k entries), and have added several layers of heuristic post-segmentation. For example, Jieba tends to split up chengyu into two words, but I've decided they should be displayed as a single word, since chengyu are typically a single entry in dictionaries.

[0] https://github.com/fxsjy/jieba

rmccrear

[dead]

sarabande

纔 in this case should use the definition of 才 (cai2) not (shan1) which is extremely uncommon. Otherwise, cool app!

routerl

Could you post the text you used? This kind of thing goes straight into my unit tests.

I'm also working on showing all the pronunciations/definitions for a given hanzi, it should be ready later this week.

sarabande

I used the example sentence from your link.

routerl

Got it, thanks!

bnly

Nicely done, this looks quite useful!

maxglute

Very well executed.

thaumasiotes

> Thus, learning Mandarin by reading requires first memorizing hundreds or thousands of words, before you can even know where one word ends and the next word begins.

That's not true at all; you can go a long way just by clicking on characters in Pleco, and Pleco's segmentation algorithm is awful. (Specifically, it's greedy "find the longest substring starting at the selected character for which a dictionary entry exists".)

Sometimes I go back through very old conversations in Chinese and notice that I completely misunderstood something. That's an unfortunate but normal part of the language-learning process. You don't need full comprehension to learn. What would babies do?

hassleblad23

Great work OP.