2014年11月5日星期三

Chinese word segmentation

In the course assignment 2, we practice how to do text classification which was introduced in the lecture two. This assignment helps me to review and better understand the content of lecture two. Before we do text classification, we should split sentences into words first. Basically, word segmentation is very important for natural language processing (NLP). I think this is the basis of NLP tasks in social media analysis. In the assignment, we only deal with English text. As we know, in English sentences there is a space between two words which can be regard as a natural delimiter.
Figure 1 English word segmentation 
However, there is no apparent delimiter between words in Chinese sentences. Therefore, it is more difficult to split Chinese sentences into words than English sentences. Consequently, I find out that there are three methods of segmentation algorithms which can be used to split Chinese sentences.


Figure 2 Chinese word segmentation
        

1.       Character matches method
This algorithm is to match the Chinese string with a ‘sufficiently large’ machine dictionary which according to a certain strategy. If a string is found in the dictionary, a word is identified. According to the scan direction, the match can be divided into forward match and reverse match. In accordance with the priority of different lengths, the match can be divided into longest match and shortest match.

2.       Understanding method
This segmentation method is to let the computer simulate the sentence understanding process of human beings to achieve the effect of identifying word. However, since the general and complexity of Chinese language, it is difficult to organize various kinds of language information to a machine directly read form. As a result, the understanding-based word segmentation system is still in the experimental stage.

3.       Statistics method
As we know, Chinese word is the combination of Chinese characters. Therefore, in a text, the more times consecutive characters occur simultaneously, the more likely they constitute a word. Consequently, to count the frequency of the simultaneously occur consecutive word, is a way to split sentences into word. However, this method also has some limitations, some high co-occurring frequency characters always be taken out, but they can not form a word.

In summary, Chinese word segmentation is still very difficult. When we actually split Chinese sentences, we may synthesis use the above three segment methods.

15 条评论:

  1. Word segmentation is a key technique of NLP. Essentially, it is a statistics method. In Cen Ge's article I have learned a lot. Hope we can discess some time and thereby learn from you.

    回复删除
    回复
    1. Thank you for your comment. I know you are familiar with NLP and know some algorithms as well. I also hope that I can learn NLP from you when you have time.

      删除
  2. Chinese word segmentation is more complex than English. You show three methods to deal with chinese word segmentation in this article which is useful. Hope we can have a discussion to know more.

    回复删除
  3. Hi Zhou
    It is very inspiring to read your post especially when I am struggling with the python assignment II . What impresses me most is that you not only mentioned the possible methods but also compared the advantage vs disadvantage of them .

    回复删除
  4. For NLP, it is true that Chinese word segmentation is hard to achieve.
    Even though the segmentation problem is solved, we still need to wait for a longer time to have a mature NLP system for Chinese. Since there is lack of resource for doing NLP of Chinese words. For example, the dictionary of Chinese words for NLP and the part of speech library of Chinese words.

    回复删除
  5. Thanks for sharing. Chinese word segmentation really a big problem. Even combine with the three methods you present may not get a satisfied result. Sometimes we need manually check the word. This is what we do before.

    回复删除
  6. 此评论已被作者删除。

    回复删除
  7. I think statistics method should do much more pre-processing work on data, like deleting stop words and garbage code. And it also need computer to study large amount of data. It is important to set proper rules, and we will get more accurate results.

    回复删除
  8. Word segmentation is always hard to be achieved. Regarding its structure , meaning or special expression, it may be impossible to get 100% accuracy in correctly separating words from a sentence , no matter in English or Chinese. It is interesting to learn something new in word segmentation from you! Thank you!!

    回复删除
  9. Dear Cen, word segmentation is very hard to implement especially in Chinese. There are lot of model that to handle the Chinese word segmentation such as HMM... I'm trying to learn and understand some basic idea. Thank you.

    回复删除
  10. Thank you for your sharing.By reading through the blog, we have learn how to build a single system with a single internal representation according to whatever user-defined standards. The solution you give us is fully implemented ,and the module will be available in our final report.

    回复删除
  11. your idea is something about sentence segmentation. I think this is related to the grammar. but unfortunately, It is difficult to analyze the Chinese segmentation. Have you heard that the order of sentences i chinese is not of influence on the understanding?

    回复删除
  12. I think it is difficult to implement sentence segmentation on Chinese as its structure is different from English. It would be a challenging work. Hope you get success on it.

    回复删除
  13. It is very inspiring to read your post especially. Thank you for your sharing. Chinese word segmentation has been a very important research topic not only because it is usually the very first step for Chinese text processing, but also because its high accuracy is a prerequisite for a high performance Chinese text processing such as Chinese input, speech recognition, machine translation and language understanding, etc.

    回复删除
  14. Do you think it's difficult or easy to analyze Chinese?

    回复删除