It’s been almost a year since I moved back to China. And currently I’m still struggling with Chinese.
回国几乎一年了,目前我还在苦读中文。
Unlike English there are no spaces in Chinese. Figuring out the proper segmentation for Chinese words in a sentence can often be a mind-numbing task for Chinese learners, especially when the sentence contains Chinese words and characters that one is not familiar with.
不像英文,中文并没有空格。对于还在学习汉字的人来说,很多时候分词是一件令人费解的事,尤其当句中含有学习者不熟悉的词与字。
It has occured to me that many Chinese learners including myself would be able to perform word segmentation more efficiently if we can preview beforehand what are the difficult words in each paragraph (i.e. Chinese words we are likely not familiar with) and have each word annotated with its pinyin and some rough definition. This would also improve the whole reading experience.
在阅读一篇文章前,如果我能事先预览段落中有哪些难词(也就是很大机率我不熟悉的词语),以及这些词语的拼音与大致意思,我相信我(包括很多中文的学习者)将能更高效率地进行分词。这也会让整个阅读体验变得更棒。
And it would be even nicer if there is a simple procedure that would enter everything we need to remember (i.e. the words we are not familiar with, together with their pinyin and definitions) into a system like Anki[1] where we can later perform active recall[2] and spaced repetition[3] to develop long-term memory for these words in an efficient manner.
然后更棒的是如果能有一个极其方便的流程来为学习者把文章中不熟悉的词语,连带拼音与意思,输入进一个类似Anki[1]的学习系统里,有助接下来进行「活性回忆」[2]与「间隔重复」[3]的练习,来提高词语进入长期记忆的效率。
[1]: Anki - Powerful, intelligent flash cards
[2]: Retrieval-Based Learning: Active Retrieval Promotes Meaningful Learning (2012)
● ● ●
After many burnouts and failures (which included screwing up my MiraclePlus interview) and realising the video editor project that I was working on was not going anywhere, I decided I wanted to work on a chatbot assistant that can help me to be more productive learning Chinese. And maybe others will find it to be useful as well =)
经历了多次倦怠与失败(其中包括亲手把我的奇绩创坛面试搞砸),并意识到我正在做的视频编辑器项目也走不了多远,我决定我想做一个chatbot助理,来帮助我更高效率地学习中文。或许其他学习者也会发现它有用 =)
And Archy the Anki Bot 0.0.1 was born.
就这样吖奇说Anki助理0.0.1出世了。
The Use Cases 用例
1: Extract difficult Chinese words from WeChat articles.
一、从微信公众号文章中提取难词。
2: Annotate Chinese words with pinyin and rough definitions (expressed in English).
二、给词语标上拼音与英文大意。
3: Generate a deck of Anki notes from Chinese words.
三、从一组词语生成一组Anki卡片。
Design & Implementation & Demo 设计与履行与演示
Basically we would have an ArticleAnalysor
, a TextAnalysor
, a Lexicographer
, and an AnkiDeckGenerator
. And we would integrate everything in main.ts where we handle Wechaty callbacks.
基本来说,我们会有一个ArticleAnalysor
、 TextAnalysor
、 Lexicographer
、 AnkiDeckGenerator
。 然后,我们会在处理Wechaty回调的main.ts中合并所有东西。
For the current use cases, we would use the ArticleAnalysor
to extract text from the WeChat Article (using request
& cheerio
), the TextAnalysor
to tokenise the text into words (using jieba
with a pretrained model in paddle
), the Lexicographer
to assign a difficulty score to each word (using an ad hoc formula with Chih-Hao’s Chinese characters meta-data), as well as to give English definitions and pinyin to selected words (using CC-CEDICT). And lastly AnkiDeckGenerator
is for generating a deck of Anki notes (using genanki).
实现当前用例,我们将使用ArticleAnalysor
来从微信文章获取文本(request
+ cheerio
)、 TextAnalysor
来对文本进行分词(jieba
+paddle
中一个训练好的模型)、 Lexicographer
为每个词语分配一个难度分数(一个随意的公式+Chih-Hao的汉字元数据),以及为词语提供英语定义和拼音(CC-CEDICT)。 最后,AnkiDeckGenerator
将用来生成一组Anki卡片(genanki
)。
Gluing everything together functionally and this is what we get:
函数式地把所有东西粘起来,即可得出:
What’s Next? 接下来呢?
still in the midst of planning but here are some rough ideas
-
Refinements 功能改良
-
As we can see the ad hoc word difficulty scoring formula isn’t performing super great at the moment. That is something I need to experiment and perhaps do some text scraping and use a combination of BERT with a self-trained model, etc to achieve a more accurate scoring system.
-
The pretrained
paddle
model injieba
works well in general but it may still give unsatisfying results (e.g. at times when a sentence contains a person’s name). Trying out different models aside, my plan is to engineer around the problem (i.e. to have results that always make sense to the users) using tools like StandfordNLP’s stanza or approach the problem differently, etc. -
I’m also thinking about extending the
Lexicographer
to contain definitions from different dictionaries as well as online search results that is useful to the language learners, etc.
-
-
MiniApp & Premium Version & The Future 小程序与会员版与未来打算
-
Anki is an amazing and very powerful tool but I feel like it is too exam-orientated in the sense that it is best utilised by people (e.g. medical students) with the aim of doing well in an upcoming exam, etc. And from a UI/UX perspective it has a steep learning curve. I’m currently working on a WeChat and TikTok MiniApp inspired by Anki but with a more laid-back take on it. The end product will a nichely designed tool for people who want to improve their Chinese with the intention to read and speak better rather than scoring well in exams. In the premium version it would come with a chatbot assistant like Archy the Anki bot.
-
Archy the Anki bot will always remain free and open-source on Github. I will continue to improve it as I work on the commercial aspect of the project described above so that I can continue doing this full-time and maybe it can become ramen profitable. 🍜 🍜 🍜
-
If things go well I would like to scale it up to cover different language learning (e.g. English, Japanese, German), as well as going beyond language learning to become a full-fledge note-taking productivity tool for autodidacts. It will be like Notion but more for remembering stuff and visualising knowledge representation. And at the core of it would be a cross-platform chatbot assistant* =) At the moment I’m reading up on how to train a model to do handwritten diagram recognition (e.g. mind map, UML, flow chart, etc) as well as looking into visual languages like Chalktalk. ⚗️ ⚗️ ⚗️
-
*: in general from a product perspective I believe chatbot is a great I/O into the world, especially as social media apps become the new browsers.
Lastly 最后
Huge thanks to
-
contributors of the jieba library for making jieba such an amazing tool!
-
the CC-CEDICT community for doing such an great job and licensing it under CC BY-SA 3.0!
-
contributors of the genanki library for writing such an easy-to-use tool!
-
the Wechaty community and everyone involved in making Wechaty such a wonderful lib! And the Juzi.bot team for opening up their padplus protocol ecosystem for outsiders like me!
If you are interested in the development of this project feel free to follow Archy.sh on WeChat and TikTok or join our mailing list =)
Also please feel free to fork my repo, deploy your own bot, or just do anything with the code, or open issues if there are any! Thanks!
p.s. 写中文写到中间有些累与懒🥴「吖奇说记忆卡片」小程序上线后更多关于未来的去向(中+英)会在公众号有的看~ 感兴趣的朋友可以关注我的公众号与抖音@吖奇说~
作者: Archy Will He 何魏奇,on and off创了八年都没有发,目前在全职做吖奇说(Archy.sh)这个项目。
Github Repo: Archy the Anki bot (吖奇说Anki助理)