Chinese Text Analyzer review:
I had the pleasure of reviewing Imron's new Chinese Text Analyzer program upon receiving a free license courtesy of Chinese Forums. You can download the program here.
I'm an upper intermediate level self study student. I'm a heritage learner, so my spoken chinese is much better than my reading. I thought this tool would be a great asset to helping me acquire better reading skills. I planned on using it with Pleco, so this review will discuss the integration of both applications.
So I have a desktop running windows and a macbook pro running Mac OSX, and I installed the Chinese Text Analyzer on both. I used Wine to install Chinese Text Analyzer on the Mac, and there were no problems with installation. The windows installation was much easier, as the program runs natively on windows. It was fast, and I had no issues. It takes literally two button clicks to install on windows. Most of this review will be based on my experience with the program in the Windows setting.
So the first step the program recommends is for you to import a list of your known words so the text analyzer would be able to identify known and unknown words appropriately.
I use Pleco for all of my study and flashcards, so it has a complex list of all the words that I've been tested on and know. In Pleco you can define your known words however you like. I define my "known" words as words that have a score of greater than 1000 points.
So the first thing I did was figure out how to export my list of known words from Pleco into the Text Analyzer. In Pleco, I did this by going to Organize Cards, and making a New Category called "Known". I then used the search function in Organize cards to search for score >= 1000, then I batch added all the cards that came up to the Known category.
I then used the Import/Export selection, changed Export cards to "cards in categories" instead of "all cards", and selected my Known category. I exported as a text file in UTF-8, and exported words only (no definitions), as I think Text Analyzer only requires a list of words.
I think the only down side to doing it this way is that you have to do a manual search to add new cards to your Known category and manually export the known card list into Text Analyzer each time if you want to keep your known words updated. I don't know if there is an easier way of exporting your known words out of Pleco, but this way worked for me.
Anyway, I then used the File Manager in Pleco to upload the file via wifi to my computer. (I love this feature of Pleco). The exported file is just a txt file that you can then import into the Chinese Text Analyzer.
When you first install the Chinese Text Analyzer, it has a popup that says "Welcome to Chinese Text Analyzer! Before you begin you should import lists of words that you already know. Chinese Text Analyzer can read files exported by popular flashcard programs such as Pleco and Anki, or you can import words from pre-made lists of HSK vocabulary. Later on you can manually add words while you are reading Chinese content."
In this window, you can either click "Import..." or you can import words using File --> Import...
I imported my list of known words from Pleco, and it imported, but I would have liked to see a success message or something to let me know it worked ok. Rather, it just took me back to a blank screen, and I wasn't sure if anything had happened. I was able to get confirmation by going to Word Lists --> View Known and seeing a list of words there.
I then tried opening some reading practice files. By going to File --> Open, I was able to find my txt files and they open very very quickly. I was even able to shift & select an entire folder's worth of files and open them at once. Even opening 10+ files, the program was very very snappy. If you open multiple files at once, they open in individual tabs in the program which is very nice. I did try to overload the program with a bunch of longest texts I have, and it was still amazingly fast to analyze. However, I did notice that if you have more than about 7 or so tabs open, you will be unable to maneuver to the tabs on the right, since there is no way to access tabs that don't fit on the screen. I don't know how important this is, as people theoretically won't be reading 10 books at once, but I thought I would note this finding.
I am very impressed by how fast the program opens and analyzes the documents. Here are some well known novels that I've tested with their processing time in seconds (I opened all 4 at the same time using shift & select in the file--> open box):
Journey to the West
The Three Kingdoms
The processing time is taken from the upper left statistics window which I will describe in more depth later. It probably does vary based on your computer specs, and I have to admit my computer is pretty decked out for photo and video processing. But I imagine the program will run pretty fast on all computers, and I think the segmenting a novel in under 1 second claim is definitely true.
The default font that the program uses is ok, but not my favorite. You can go to Format --> Font, and there are a few other font options that you may like more. I'm not sure where the program gets its fonts from - if it is using pre-installed fonts on your computer, or if the program has a set of fonts that comes with it - but I went through the font options that I had, and there are quite a lot of font options that do not display Chinese characters correctly or at all (white boxes). Given that the sole purpose of this program is to display Chinese text, I think it would be really helpful if you curated the available fonts to only those that display Chinese text. I didn't go through all of the options, but in my cursory look I would say that 80-90% of the font options are not suitable for Chinese characters. Again, I'm not sure if this varies based on what fonts you have pre-installed on your computer or not.
I don't know if this is an option, but it might be nice if you could include a more brush script-y type font. I like the FZKaiTi font available as an add on in Pleco.
Now on to the statistics windows on the right side. The top window appears to have statistics for the entire document, including total number of words, total known words, percent known words, number of unique words. I noticed that the headings "Known" and "Percent Known" are used under the "Total" and "Unique" categories, and I recommend you make a clearer division between the "Total" section and the "Unique" section. Otherwise, it might look like the "Known" and "Percent Known" are duplicates, but they have different numbers.
The program also lists some character statistics and File statistics. I'm not really sure how important the File Statistics are, but I guess it doesn't hurt to have them there. I probably would never really look at it though in real usage.
One additional statistic that I think would be good to have is Number of Unknown words in a document. This way you could get an idea of how many words are left to learn for any particular text. I guess you could always calculate this yourself with number Unique minus number Known unique, but it shouldn't be hard to implement the Number of Unknown as well, which may be more helpful than the number of known words.
The bottom right window has statistics broken down for each word. For each word, it lists Frequency, % Frequency, Cumulative % Frequency, and First Occurance. I think the Frequency and % Frequency columns are the most important, especially if you want to prioritize vocabulary studying. You can very easily sort words by frequency.
I'm honestly not sure what "Cumulative % Frequency" means, and I was not able to figure it out.
I'm not sure how helpful the "First Occurrence" column is either. I haven't determined a use for it.
I did notice that if you double click anywhere in the row for a word in the bottom right window, it will automatically take you to the first occurrence of that word and will highlight all other occurrences in pink. I think an additional feature I would like to see would be a set of left right arrows so you can go to the next occurrence of the word of interest fairly easily if you are in a long document, and see each place the word is used in context. I think you can use the Edit ---> Find feature for this as well, but it would be nicely streamlined if a set of left/right arrows popped up when you double clicked a row in the word statistics window, without having a window blocking your text. Or even better, have the left and right keyboard keys move between each instance of the word.
There are three tabs on the bottom of the window to look at All words, Known words only, or Unknown words only. I have no issues with that layout. There is also a search field, which I have not used extensively. I think it only works if you type the characters. Maybe one future feature could be allowing pinyin search as well.
Now my review of the reading experience. I imported a few documents, and there were quite a few words marked in red as unknown that I already knew, perhaps I just never made Pleco cards for them. I found it very annoying to have to right click a known word and mark it as known. I think it would be nice if there was a keyboard shortcut for marking words as known - maybe hitting the spacebar or enter key or something to make this process easier and less intrusive on the reading experience. I just don't like having to right click and select from a text list to mark words as known, it really does take some of the flow away from reading that I think hitting a keyboard key would improve.
I'm not sure if Imron had in mind designing the Chinese Text Analyzer as just a tool to aid in picking which books/texts to read , or as a stand alone reader, or both. But I've been spending some time with it, and I find that it is very helpful in determining how appropriate a text is to your vocabulary level. This is of course assuming you update your known words list periodically which may be kind of a hassle.
However I'm not sure if I will spend most of my time doing dedicated reading on it. I have to admit that I miss having a pop up dictionary feature. I understand that Imron left this out purposefully to discourage bad habits. I'm sure over time I can get used to not having a pop up dictionary and studying the unknown words independently, but as of now I'm finding it hard to give up the crutch. I think especially in cases where not knowing a few key words in a sentence completely prevents you from understanding the meaning of the sentence. I do find it more challenging to read without a pop up dictionary, and there is somewhat of a mental block knowing that you don't have something convenient to fall back on.
One thought that I had for people who may choose to use the Chinese Text Analyzer as a dedicated reader is the fact that there is no Bookmark feature in the program. Especially for longer novel length books, it would be immensely helpful to have a bookmark feature so you don't have to find your place again if you stop reading and close the program. It may also be helpful to have an option to record notes in certain sections of the books or mark up places that you had difficulty reading and may want to go back and re-read after studying the vocabulary in that section.
Now I'll review the Export settings. I have tried the File --> Export --> To File settings. I have never tested the To Email, because I think it works through Microsoft Outlook, and I do not use Outlook.
When you go to File --> Export --> To File, a dialog box opens up with two tabs. The Document tab is first, and is sectioned into Document, Paragraph, and Word sections with "Pre" and "Post" under each section with a text box field. I actually do not know what these options do, as it was entirely unclear in the program. I think there should be a sentence or two of explanation here. I left all the fields blank, and it exported the entire document I had open with no changes. I do not know what the Pre and Post mean and what that tab is meant to do.
The second tab under Export is labeled Word List. This tab seemed much more intuitive. You can export All words, Known words, or Unknown words. I personally think that the default should be set to Unknown instead of All, as I think that is how most people will be using the program. I for one intend to use it to identify unknown words that need further study in Pleco, and I found that I very easily accidentally exported "All" words instead of "Unknown" words since All is currently the default. You can sort by Frequency, First Occurrence, or Word in ascending or descending order. I think the Frequency (Descending) as the default is appropriate for this one. You can select to export All rows, or the Top X number of rows (in case you want to just study the most frequently used 100 words in a novel for example). I think this is a very useful feature.
There are lots of fields available for export: Word, Simplified, Traditional, Simplified[Traditional], Pinyin (Tones), Pinyin (Numbers), English Definition, Sentence, Cloze Sentence, Frequency, % Frequency, Cumulative Frequency, First Occurrence. And you have the option of selecting as many fields as you want to export, so there is a lot of flexibility.
I'm not really sure what the difference between Word and Simplified is, since I exported both fields and they are the same in my test set. Perhaps it depends on what format the original document that the word came from uses. All of my texts were imported in Simplified.
Most of the other fields seem self explanatory. I'm not sure what dictionary is used, but each word has several of the most common definitions separated by "/". My test set seemed to import fine into Microsoft Excel as a Tab delineated file.
One very interesting part of the Chinese Text Analyzer is its ability to export Sentences where your word is found. It seems to be exporting the sentence that has the first occasion of the word. I did notice that when I exported the "Sentences" and "Cloze Sentences" fields, some of the fields exported with the previous sentence's period preceding it. An example:
Other than the leading period, it seems to parse the sentences well. Not all of the 100 words I exported in my test set had a leading period, but the majority of them did. It may have something to do with the source document I used, so this may vary for other people, I don't know.
I also tested the export function with both the Sentences field and Cloze Sentences field exported. I am not sure why, but some of the rows imported weird into Excel. As in the Cloze sentence was cut off and put into a second row. I don't know if this has to do with tabs being in the actual text giving it problems or not.
This example was exported with the fields: Word, Simplified, Traditional, English Definition, Sentence, Cloze Sentence. You can see that the Cloze sentence got put on the second row.
/to walk/to go/to run/to move (of vehicle)/to visit/to leave/to go away/to die (euph.)/from/through/away (in compound verbs, such as 撤走)/to change (shape, form, meaning)/
楔子 张 天 师 祈 禳 瘟疫 洪 太 尉 误 走 妖魔
楔子 张 天 师 祈 禳 瘟疫 洪 太 尉 误 [...] 妖魔
This is the sentence in context of the actual text. Note that it is not an actual sentence, there are no periods (and no leading period) but it is followed by a return.
楔子 张天师祈禳瘟疫 洪太尉误走妖魔
I think the problem occurs when there is an Enter/return at the end of the sentence that the program picks. I haven't investigated this extensively, but I thought I would let you know there may be a slight bug with exporting of sentence fields. This of course is probably dependent on the quality of the text you are deriving the sentences from, I understand. But I think it might be worth investigating and seeing if these small issues are repeatable and can be fixed before the big release.
Overall I think the program is pretty useful and seems good. These were just some comments that I had while extensively exploring the program for a full day or so. I will try some more extensive reading using the program in the next few weeks and I'll give an update if necessary.