In
the course assignment 2, we practice how to do text classification which was
introduced in the lecture two. This assignment helps me to review and better
understand the content of lecture two. Before we do text classification, we
should split sentences into words first. Basically, word segmentation is very
important for natural language processing (NLP). I think this is the basis of
NLP tasks in social media analysis. In the assignment, we only deal with English
text. As we know, in English sentences there is a space between two words which
can be regard as a natural delimiter.
Figure 1 English word segmentation
However, there is no apparent delimiter
between words in Chinese sentences. Therefore, it is more difficult to split Chinese
sentences into words than English sentences. Consequently, I find out that
there are three methods of segmentation algorithms which can be used to split Chinese
sentences.
Figure 2 Chinese word segmentation
1.
Character
matches method
This
algorithm is to match the Chinese string with a ‘sufficiently large’ machine
dictionary which according to a certain strategy. If a string is found in the
dictionary, a word is identified. According to the scan direction, the match
can be divided into forward match and reverse match. In accordance with the
priority of different lengths, the match can be divided into longest match and
shortest match.
2.
Understanding
method
This
segmentation method is to let the computer simulate the sentence understanding process
of human beings to achieve the effect of identifying word. However, since the
general and complexity of Chinese language, it is difficult to organize various
kinds of language information to a machine directly read form. As a result, the
understanding-based word segmentation system is still in the experimental
stage.
3.
Statistics
method
As
we know, Chinese word is the combination of Chinese characters. Therefore, in a
text, the more times consecutive characters occur simultaneously, the more
likely they constitute a word. Consequently, to count the frequency of the
simultaneously occur consecutive word, is a way to split sentences into word. However,
this method also has some limitations, some high co-occurring frequency
characters always be taken out, but they can not form a word.
In
summary, Chinese word segmentation is still very difficult. When we actually
split Chinese sentences, we may synthesis use the above three segment methods.
Word segmentation is a key technique of NLP. Essentially, it is a statistics method. In Cen Ge's article I have learned a lot. Hope we can discess some time and thereby learn from you.
回复删除Thank you for your comment. I know you are familiar with NLP and know some algorithms as well. I also hope that I can learn NLP from you when you have time.
删除Chinese word segmentation is more complex than English. You show three methods to deal with chinese word segmentation in this article which is useful. Hope we can have a discussion to know more.
回复删除Hi Zhou
回复删除It is very inspiring to read your post especially when I am struggling with the python assignment II . What impresses me most is that you not only mentioned the possible methods but also compared the advantage vs disadvantage of them .
For NLP, it is true that Chinese word segmentation is hard to achieve.
回复删除Even though the segmentation problem is solved, we still need to wait for a longer time to have a mature NLP system for Chinese. Since there is lack of resource for doing NLP of Chinese words. For example, the dictionary of Chinese words for NLP and the part of speech library of Chinese words.
Thanks for sharing. Chinese word segmentation really a big problem. Even combine with the three methods you present may not get a satisfied result. Sometimes we need manually check the word. This is what we do before.
回复删除此评论已被作者删除。
回复删除I think statistics method should do much more pre-processing work on data, like deleting stop words and garbage code. And it also need computer to study large amount of data. It is important to set proper rules, and we will get more accurate results.
回复删除Word segmentation is always hard to be achieved. Regarding its structure , meaning or special expression, it may be impossible to get 100% accuracy in correctly separating words from a sentence , no matter in English or Chinese. It is interesting to learn something new in word segmentation from you! Thank you!!
回复删除Dear Cen, word segmentation is very hard to implement especially in Chinese. There are lot of model that to handle the Chinese word segmentation such as HMM... I'm trying to learn and understand some basic idea. Thank you.
回复删除Thank you for your sharing.By reading through the blog, we have learn how to build a single system with a single internal representation according to whatever user-defined standards. The solution you give us is fully implemented ,and the module will be available in our final report.
回复删除your idea is something about sentence segmentation. I think this is related to the grammar. but unfortunately, It is difficult to analyze the Chinese segmentation. Have you heard that the order of sentences i chinese is not of influence on the understanding?
回复删除I think it is difficult to implement sentence segmentation on Chinese as its structure is different from English. It would be a challenging work. Hope you get success on it.
回复删除It is very inspiring to read your post especially. Thank you for your sharing. Chinese word segmentation has been a very important research topic not only because it is usually the very first step for Chinese text processing, but also because its high accuracy is a prerequisite for a high performance Chinese text processing such as Chinese input, speech recognition, machine translation and language understanding, etc.
回复删除Do you think it's difficult or easy to analyze Chinese?
回复删除