The most common Hanzi in Chinese language

The Work

I’m going to introduce a work that is related to two of my field of interest: Chinese language and programming.
This is about a study on the frequency with which the most common Chinese Hanzi (word) appears in the modern language. I’m interested in this topic because I’m right now studying Chinese and I was wondering which words and characters I should be confident with as soon as possible.
When we study a language, often we aim to pass certain exams (HSK in my case) and each language exam usually have a set of words and grammar points to know. This is especially true for Chinese, since its lexical characteristics, and vocabularies for each HSK level exist (I suggest HSK Academy).

Similar works already exist but my plans are wider, I’m going to gather a lot of data, and I want to do it in a proper manner so that I’ll be able to use it for other purposes.

So, how did I decide to estimate the modern Hanzi frequency?
Well… it’s quite simple: I chose a big online Chinese news site and I started downloading their news articles by parsing their RSS data. I do this with a python script that is automatically launched every 20 minutes 24/24 7/7.
Up to now (7 days since the start) I gathered almost 2.000 articles for a total amount of about 6MB of parsed and “clean” data (just plain Chinese text), that is about 1,7 million characters (utf8 encoding).
One last note: official mainland Chinese written standard is simplified Chinese, so I expect to work with simplified characters.

Programming details

As I said, I’m using a python script, and language is Python 3 (version 3.4).
First I define the sources, that is, the RSS links

The first two variables are used to “sanitize” the news links (we’ll see later); the third is a list of pairs of value. The first value is the actual link, the second is a key I assign to the source.
The first version of my code had a “manual” RSS parsing algorithm for which I used a simple HTMLParser custom object. However I later switched to the easier and more compact (since specialized) feedparser library.

Here I call the parse method on a feedparser object, this method just download the specified RSS page and parse it, producing a sort of dictionary structure representing what has been downloading, not much different than a JSON object.
So, for each RSS page, I just extract the data I need building a dictionary object that is then appended to a list.

Up to this point was straightforward, wasn’t it?
But now I should consider a fact: since the RSS sources gives only the last 15 news/articles per category, I have to consistently record them somewhere…
I do that by saving the entries in a binary file that is updated each time the script runs. The binary RSS file is first loaded if exists, otherwise a new empty binary file is created. With pickle library I load the article list, update it, and “dump” it again back in binary file.
The update phase is simple: since the structure I use is a dictionary, by using the article links as ID, I just insert the retrieved RSS entries with no worries for duplicates. I also record last news time and last update (script run) time.

The last phase now is the actual download of the articles.
I have the links, so I just need to download and parse the pages. In order to do this I use a custom HTMLParser object.

What I do this is simple: I search for article body start, that is the start of div with ‘artibody‘ id; then I write down every paragraph (the <p> tag check is needed cause there are also non-text element in article body) appending it to a str variable.
The actual download is performed by the following code:

I must unescape the response and replace some tags since the parser has some problem dealing with the raw input (never really understood the exact problem tough…).
So, in the end, the script get the latest RSS entries, save links on a binary file and download the pages in local plain text files using the original page html name as filename. Notice that only pages not already downloaded are retrieved, and if you delete local pages, they will be downloaded again.

Now that I have the data, I must process it.
Assume that we are fine with just counting the Hanzi chars for now…
the task is really trivial, as you’ll see.
I just load all the pages and count the characters that are valid Hanzi chars; a “valid” Hanzi char is a value that matches a precise regex pattern as in the code below:

I have everything now… I just need to show it!
An horizontal bar histogram is fine for this purpose, so do it:

Just notice that pyplot by itself was not able to render Chinese characters, so I had to specify a proper FontProperties object.

The final result
charsBy far, the most common Hanzi is 的 and this is not a surprise since this character has a lot of different uses in Chinese language as a particle.
Second place for 一 that means “one” but is also used to compose many other words (i.e. adverbs).
Third place is for 人 “person” and this character too can be used in many ways.
After the top three, we find 在 that means “to stay somewhere” often use to specify the place where an action takes place, 了 that is a language particle, 是 and 有 are the verbs “to be” and “to have” respectively and they have many other uses too, and so on…
But now let’s give a look also to least common Hanzi, we just need to change the order of sort function switching reverse parameter from True to False:

And this is the result:

The least common character is the first one in the plot, though the first 50 least common characters all count 2 so you just see a nice rectangular rainbow.
On top of the graph we find 荻 that is an Asian indigenous plant, while second Hanzi is that should refer to a Chinese type of serving spoon but also related to ladybugs…
I leave the rest to you 🙂

Some recap data at this time: I processed 1689934 Han characters for a total of 4398 distinct Hanzi.
We must consider that the modern Chinese language counts about 3.500 commonly used Hanzi as stated in 现代汉语常用字表 (Chart of Common Characters of Modern Chinese) but overall we could count up to the double of this value (7.000) as stated in 现代汉语通用字表 (Chart of Generally Utilized Characters of Modern Chinese). Source: wikipedia.
One last clarification: these data are collected in a newspaper contest, a “serious” news website so the data could be (and it is for sure) biased by the covered topics. So, for example, I don’t expect to see much about food, dress, holidays…

And now?

Ok… what else could I do with all the data?
For space constraints I couldn’t show the full character list so I’m going to provide a link to an html page with all the frequencies; I’d also like to add translations where possible by using CC-CEDICT open source dictionary data.
I’m also already working on character merging to build statistics for entire words (verbs, adverbs, adjectives…) other than single characters. I’m trying this approach with PyPLNR python package.
Another approach I could follow is to split the data by RSS source, since each RSS link is related to a specific topic.

I’ll soon post new works so stay tuned!
See you!

update: follow in part 2

Leave a Reply

Your email address will not be published. Required fields are marked *