In
the course assignment 2, we practice how to do text classification which was
introduced in the lecture two. This assignment helps me to review and better
understand the content of lecture two. Before we do text classification, we
should split sentences into words first. Basically, word segmentation is very
important for natural language processing (NLP). I think this is the basis of
NLP tasks in social media analysis. In the assignment, we only deal with English
text. As we know, in English sentences there is a space between two words which
can be regard as a natural delimiter.
Figure 1 English word segmentation
However, there is no apparent delimiter
between words in Chinese sentences. Therefore, it is more difficult to split Chinese
sentences into words than English sentences. Consequently, I find out that
there are three methods of segmentation algorithms which can be used to split Chinese
sentences.
Figure 2 Chinese word segmentation
1.
Character
matches method
This
algorithm is to match the Chinese string with a ‘sufficiently large’ machine
dictionary which according to a certain strategy. If a string is found in the
dictionary, a word is identified. According to the scan direction, the match
can be divided into forward match and reverse match. In accordance with the
priority of different lengths, the match can be divided into longest match and
shortest match.
2.
Understanding
method
This
segmentation method is to let the computer simulate the sentence understanding process
of human beings to achieve the effect of identifying word. However, since the
general and complexity of Chinese language, it is difficult to organize various
kinds of language information to a machine directly read form. As a result, the
understanding-based word segmentation system is still in the experimental
stage.
3.
Statistics
method
As
we know, Chinese word is the combination of Chinese characters. Therefore, in a
text, the more times consecutive characters occur simultaneously, the more
likely they constitute a word. Consequently, to count the frequency of the
simultaneously occur consecutive word, is a way to split sentences into word. However,
this method also has some limitations, some high co-occurring frequency
characters always be taken out, but they can not form a word.
In
summary, Chinese word segmentation is still very difficult. When we actually
split Chinese sentences, we may synthesis use the above three segment methods.