More on common Chinese words: adjectives, verbs, nouns

In the last (and first) two blog posts I showed how I was able to retrieve hanzi frequency from an online news website. I also showed you how to produce an appealing HTML output for the processed data with Jinja2 python package.
But now, I want to go deeper trying to get more useful informations. Up to now I worked on single Hanzi so characters that are not necessarily words in Chinese language.

What I’m going to do is to apply the same frequency analysis to entire Chinese words: adjectives, verbs, nouns…

However this is not that easy. In many “westner” languages like English, French, Italian… it is naive to split a sentence in words (words are separated by commas, dots, spaces…) and not so difficult to categorize them (you just need a vocabulary and a minimal context analysis).
Unfortunately, dealing with Chinese language is not so simple, since words are not split by lexical separators and word order can be quite hard to understand for an algorithm. We also know that often the very same word can be used as verb or adjective or even a noun.

So we have two problems: how we split hanzis characters in words? How can we know if a word is a verb, a noun or an adjective?
Google provided me an answer: PyNLPIR python package.

PyNLPIR is wrapper to NLPIR software and allows us to get a Chinese text segmentation in single words, also providing part of the speech informations. Even though the package author himself states that segmentation and classification errors can occur in the process, I just accept these mistakes since I’m working with thousands of words and the global result is still valid.

The library interface is very simple: once you set the encoding, you can feed a function with the text and get the result as a list of 2-dimensional tuple containing the segmented word and estimated part of speech.


Implementation details

Not much to say, I can just start from what I did for single Hanzi analysis and enrich it.
I’m just going to modify the part where I regex the hanzi and count them:

You see, the new lines are:

  • the pynlpir.open() call with encoding specification. This initializes the parser and specify how to manage encoding errors (in my case I ignore errors).
  • the second regex object for matching words. Unmatched words are essentially puntuaction, numbers and non-Chinese words.
  • the 5 new dictionaries where I count words (with no part of speech specification), nouns, adjectives, verbs and classifiers.

The same output functions I used for single Hanzi work fine, since the dictionaries I am producing here are all compatible. I just specifiy, of course, different output files.
Really sweet, so easy work this time!


Results

As for single characters, I provide an HTML page for each of the word categories. You can see the results following the links:
General words frequency
Adjectives frequency
Nouns frequency
Verbs frequency
Classifier frequency

Enjoy!

Leave a Reply

Your email address will not be published. Required fields are marked *