2014年11月5日星期三

Chinese word segmentation

In the course assignment 2, we practice how to do text classification which was introduced in the lecture two. This assignment helps me to review and better understand the content of lecture two. Before we do text classification, we should split sentences into words first. Basically, word segmentation is very important for natural language processing (NLP). I think this is the basis of NLP tasks in social media analysis. In the assignment, we only deal with English text. As we know, in English sentences there is a space between two words which can be regard as a natural delimiter.
Figure 1 English word segmentation 
However, there is no apparent delimiter between words in Chinese sentences. Therefore, it is more difficult to split Chinese sentences into words than English sentences. Consequently, I find out that there are three methods of segmentation algorithms which can be used to split Chinese sentences.


Figure 2 Chinese word segmentation
        

1.       Character matches method
This algorithm is to match the Chinese string with a ‘sufficiently large’ machine dictionary which according to a certain strategy. If a string is found in the dictionary, a word is identified. According to the scan direction, the match can be divided into forward match and reverse match. In accordance with the priority of different lengths, the match can be divided into longest match and shortest match.

2.       Understanding method
This segmentation method is to let the computer simulate the sentence understanding process of human beings to achieve the effect of identifying word. However, since the general and complexity of Chinese language, it is difficult to organize various kinds of language information to a machine directly read form. As a result, the understanding-based word segmentation system is still in the experimental stage.

3.       Statistics method
As we know, Chinese word is the combination of Chinese characters. Therefore, in a text, the more times consecutive characters occur simultaneously, the more likely they constitute a word. Consequently, to count the frequency of the simultaneously occur consecutive word, is a way to split sentences into word. However, this method also has some limitations, some high co-occurring frequency characters always be taken out, but they can not form a word.

In summary, Chinese word segmentation is still very difficult. When we actually split Chinese sentences, we may synthesis use the above three segment methods.