The most common Hanzi in Chinese language (part 2)

Hello again,

the last time we saw how I retrieve, store and process Hanzi data to build my (for now basic) statistics for the most common Chinese characters list.
Today I won’t go much further since I’ll just talk about how I managed to present the data I collected in a clearer and more detailed way.

I’ll show you a basic usage of Jinja2, a python package to work with Templates

I’m quite new to this since I usually do not work on web presentation layer and the only other template engine I worked with is JSF for Java (and i didn’t like it much…)

I found Jinja2 really pleasant, and really easy to use. Oh… and funny thing is that “Jinja” (神社 in japanese) means “Shinto Shrine” and Jinja2’s logo is precisely a stylized Shinto Shrine. Not Chinese but still fun to notice the oriental “touch”.


The programming part

Thanks to what we did in the last work, we now have a list of all the Hanzi we found in the articles ordered by frequency. I need to take them and compose an HTML page, I’ll go with something easy, I’m not a web artist, I don’t care 🙂 a simple table will work just fine.

Before starting to work on the data I collected, I need also something more. I want to offer something a bit more useful than a simple list of Hanzi: I’d like to provide also a brief translation of them.
Working with Hanzi and translations is not easy even with a dictionary, since the same character in Chinese could mean many things, and could also have many different pronunciations.
Anyway, the only good source for free dictionary data in an easy-to-process format that I was able to find is CC-CEDICT. The database they provide for free is just a plain text file that follows simple format rules.

At this point we can simply read and load the dictionary; here’s a simple function I wrote:

The code is commented so I don’t need to explain much more about it, however there are a couple things to consider here…
First, why am I working with both simplified and traditional Hanzi? Well, at first I tried with only the simplified ones but I soon noticed that many Hanzi from the articles were actually written in traditional, and I want to use them too. Here the problem is to consider the same character in the two standard the same character or different ones, for sake of simplicity I decided to treat them as different ones (even if I still keep a “remainder” from simplified to traditional and viceversa in my dictionary structure).
Second, why my dictionary structure is a dict of list and not just a dict? In python a dict is a structure that relates unique keys to values, here the keys are our Hanzis and the values are pinyin and translations… but a single Hanzi can have more than one pinyin,  so I need a list to manage them. When we parse the dictionary file, we can find many occurrences of the same character with different pinyin (the pair Hanzi-pinyin is the unique key of this dataset) so that’s why each Chinese character is associated to a list (of pinyin).

The next step is to join the dictionary to our Hanzi count, here I want to produce an easy-to-use structure for my template.

After having load the dictionary, the Hanzi count is merged so that I can build a new structure (a list) with all the characters (ordered) with correspondent pinyin and translations.

Jinja time!

Here I’m using Jinja for an “offline” task, since I simply write an HTML file.
First we need to create the Environment object that Jinja uses to do everything: this object tells where to find templates (./templates directory), how to build output and how to manage the global objects.
The shared object is built as a dictionary so that, in the template engine, its keys are variable identifiers with the associated values.

Let’s give a look to the template now… I only show the relevant parts, cutting out the pure aesthetic ones (mainly css styles).

A brief explanation of what is going on here…
Using a template is no much different than injecting PHP code into an HTML page. The difference is that the language is not PHP but something really close to python (not exactly the same). Of course this is a super-easy and vague simplification, since the differences behind are many, but we don’t need to know them now.
In a template, the “base” is an HTML structure, we then put some “sugar” in it to make it a bit less “bitter”.
Like for PHP, we have the escape sequences to provide commands to the template engines. We mainly have 3 escapes:
>     {# … #}    for comments
>     {{ … }}     for simple output
>     {% … %} for directives (control structures, variable assignations…)

The output of my template is mainly an html table so we start with a div and then a table tag followed by the header specification.
Here I do something not really important but useful for an easier page reading: I essentially want adjacent rows to have different background colour, so I set a boolean variable that then I negate at each new line. The boolean value tells the template when to use a tr (<tr> is the HTML tag for table rows) class or another (style classes are not shown here). Notice how I use the multi value that tells how many different pinyin configuration we have for this Hanzi: a positive value  means that I’m starting a new character and so I change the style class that won’t be changed again until new Hanzi is encountered (if multi = N > 0, then for N rows I won’t change the style, entries for pinyin configurations from the second onwards have multi value set to -1).

Let’s write the actual data.
We check again the multi variable, if positive, we are writing a new Hanzi and if I have N > 1 pinyins, then I don’t want to repeat N times the same character so we put a nice rowspan to tell browser to merge the next N rows for the first 4 columns (rank position, Count, Hanzi and traditional/simplified flag).
Pinyin and Translation do not need rowspans, though, for sake of clarity, the english translations are splitted in different lines (using <p> tag for paragraphs).

Good, we did it! We are now ready to give a look to the results, and I admit that I’m really satisfied. Don’t waste any more time, here’s the link, hope you like it too 🙂

See you again!

 

Leave a Reply

Your email address will not be published. Required fields are marked *