The Chinese Toolbox Document Analysis feature compares the current document (or all loaded documents) against the currently selected character frequency list. Settings in the “Document Analysis Settings” window allow some customization of the analysis.
In the screenshot on the left the text of the first loaded document is shown. In the middle screenshot the “Analyze Document” button/tab has already been clicked. You can see from the menu and in text of the analysis which frequency list was used. You can also see the “Document Analysis Settings” window where the number of characters to use for the analysis was specified. In this analysis all documents were merged and analyzed as a single document, determined by the first checkbox setting in the “Document Analysis Settings” window. In the screenshot on the right you can see the Document menu with the names of all the analyzed documents. These documents were imported into Chinese Toolbox 2012 from http://www.chineselearner.com/reading/chinese-translation/sherlock01-1.html.
Some interesting data is produced in the analysis, and you can use it however you like. The original intention was to determine how a set of Chinese documents could contribute to Chinese literacy. You can use this analysis in a number of ways:
This analysis feature provides three degrees of granularity:
The following is the analysis summary that appears in the screenshots above.
Analysis Summary. The detailed document analysis exists in DocumentAnalysis.u8 as a UTF-8 text file in your Chinese Toolbox document folder.
* Document: All loaded documents
* Frequency list: Modern Chinese Character Frequency List
* Total
characters: 12954
* Total analyzed Chinese characters in document: 11181
* Total unique Chinese characters in document: 1263
* Characters in document exist within top 7928 of the frequency list.
* 9893 of 11181 (88.48%) of the
characters in the document exist in the top 1000 of the current frequency list.
* 765 of 1263 (60.57%) of the unique characters in the document exist in the top 1000 of the current frequency list.
The following is the contents of DocumentAnalysis.u8 with some of the information excluded. For display on this page, most of the characters in the lists have been removed. Only the first few characters are shown so you can see the data pattern.
* Document: All loaded documents
* Frequency list: Modern Chinese Character Frequency List
* Total
characters: 12954
* Total analyzed Chinese characters in document: 11181
* Analyzed Chinese characters in document: 银色马一天早晨我 (MANY REMOVED)
* Total unique Chinese characters in document: 1263
* Unique Chinese
characters in document: 银色马一天早晨我们起 (MANY REMOVED)
* Characters in document exist within top 7928 of the frequency list.
* Unique character counts: 1:银:11; 2:色:21; 3:马:138; 4:一:239; 5:天:32; 6:早:12; 7:晨:6;
8:我:252; 9:们:105; 10:起:22; (MANY REMOVED)
* Counts of the number of times frequency list characters occur in the analyzed document: 1:的:420; 2:一:239; 3:是:199; 4:不:133; 5:了:142; 6:在:166; 7:人:90; 8:有:114;
9:我:252; 10:他:172; (MANY REMOVED)
* 9893 of 11181 (88.48%) of the characters in the document exist in the top 1000 of the current frequency list.
* 765 of 1263 (60.57%) of the unique characters in the document exist in the top 1000
of the current frequency list.
* Document characters that exist in the analyzed portion of the frequency list: 银色马一天早我们一起 (MANY REMOVED)
* Document characters that do NOT exist in the analyzed portion of the frequency
list: 晨餐摩穆摩皱眉刊 (MANY REMOVED)
* Unique document characters that exist in the analyzed portion of the frequency list: 银色马一天早我们 (MANY REMOVED)
* Unique document characters that do NOT exist in the analyzed
portion of the frequency list: 晨餐摩穆皱眉刊售 (MANY REMOVED)
See the updates for
← Chinese Toolbox and Toolbox Coding
→
especially the new Chinese Toolbox 13.1.0.5.