Question

输入文件的代码为“utf8 without BOM”，每行如下：

( IP ( NP ( NP ( NR 上海 ) ( NR 浦东 ) ) ( NP ( NN 开发 ) ( NP ( CC 与 ) ( NP ( NN 法制 ) ( NN 建设 ) ) ) ) ) ( VP ( VV 同步 ) ) )

我想使用NLTK通过

从此字符串构建树

nltk.tree.Tree.fromstring

我的输出是“\ u4e0a \ u6d77”的形式。

如何将输出转换为utf8？

我不明白为什么a的输出是utf8的形式？

# -*- coding: utf-8 -*-
import nltk
tparse = nltk.tree.Tree.fromstring
import sys
reload(sys)
sys.setdefaultencoding('utf8')
class cal_prob:
    def __init__(self):
        pass
    def input_dataset(self, path="CTB-auto-pos/"):
        trainfile = open(path+"train.txt", "r+")
        datas = trainfile.read().split("\n")
        for data in datas:
            data = unicode(data) # change them to unicode
            print data
            tree = tparse(data)
            print tree
            print unicode(str(tree)).decode("utf8")
            print unicode(str(tree)).encode("utf8")
            break
        #
        a = u"(IP \n (NP (NP (NR \u4e0a\u6d77) (NR \u6d66\u4e1c)) (NP (NN \u5f00\u53d1) (NP (CC \u4e0e) (NP (NN \u6cd5\u5236) (NN \u5efa\u8bbe))))) (VP (VV \u540c\u6b65)))"
        print a
        print a.decode("utf8")
        trainfile.close()
a = cal_prob()
a.input_dataset()

Answer 1

以下是正确打开编码文件的示例。不需要reload(sys)技巧（请参阅https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/）或其他编码/解码。

tree.pformat()按您的意愿显示树：

import nltk
import io

with io.open('train.txt', encoding='utf8') as trainfile:
    for line in trainfile:
        print tree
        print
        print tree.pformat()

输出：

(IP
  (NP
    (NP (NR \u4e0a\u6d77) (NR \u6d66\u4e1c))
    (NP (NN \u5f00\u53d1) (NP (CC \u4e0e) (NP (NN \u6cd5\u5236) (NN \u5efa\u8bbe)))))
  (VP (VV \u540c\u6b65)))

(IP
  (NP
    (NP (NR 上海) (NR 浦东))
    (NP (NN 开发) (NP (CC 与) (NP (NN 法制) (NN 建设)))))
  (VP (VV 同步)))

如何在NLTK（python）中处理中文？

1 个答案: