我有这个python脚本,我正在使用nltk库来解析,标记化,标记和块,一些可以从网上说随机文本。
我需要格式化并在文件中写入chunked1
,chunked2
,chunked3
的输出。这些类型为class 'nltk.tree.Tree'
更具体地说,我只需要编写与正则表达式chunkGram1
,chunkGram2
,chunkGram3
匹配的行。
我该怎么做?
#! /usr/bin/python2.7
import nltk
import re
import codecs
xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."]
def processLanguage():
for item in xstring:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
#print tokenized
#print tagged
chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}"""
chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}"""
chunkParser1 = nltk.RegexpParser(chunkGram1)
chunked1 = chunkParser1.parse(tagged)
chunkParser2 = nltk.RegexpParser(chunkGram2)
chunked2 = chunkParser2.parse(tagged)
chunkParser3 = nltk.RegexpParser(chunkGram3)
chunked3 = chunkParser2.parse(tagged)
#print chunked1
#print chunked2
#print chunked3
# with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile:
# for i,line in enumerate(chunked1):
# if "JJ" in line:
# outfile.write(line)
# elif "NNP" in line:
# outfile.write(line)
processLanguage()
暂时我试图运行它时出现错误:
`Traceback (most recent call last):
File "sentdex.py", line 47, in <module>
processLanguage()
File "sentdex.py", line 40, in processLanguage
outfile.write(line)
File "C:\Python27\lib\codecs.py", line 688, in write
return self.writer.write(data)
File "C:\Python27\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
TypeError: coercing to Unicode: need string or buffer, tuple found`
编辑 @Alvas回答后,我设法做了我想做的事。但是现在,我想知道如何从文本语料库中删除所有非ascii字符。例如:
#store cleaned file into variable
with open('path\to\file.txt', 'r') as infile:
xstring = infile.readlines()
infile.close
def remove_non_ascii(line):
return ''.join([i if ord(i) < 128 else ' ' for i in line])
for i, line in enumerate(xstring):
line = remove_non_ascii(line)
#tokenize and tag text
def processLanguage():
for item in xstring:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
print tokenized
print tagged
processLanguage()
以上内容摘自S / O中的另一个答案。但它似乎不起作用。可能有什么问题?我得到的错误是:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
not in range(128)
答案 0 :(得分:7)
首先,请观看此视频:https://www.youtube.com/watch?v=0Ef9GudbxXY
现在回答正确答案:
import re
import io
from nltk import pos_tag, word_tokenize, sent_tokenize, RegexpParser
xstring = u"An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."
chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
chunkParser1 = RegexpParser(chunkGram1)
chunked = [chunkParser1.parse(pos_tag(word_tokenize(sent)))
for sent in sent_tokenize(xstring)]
with io.open('outfile', 'w', encoding='utf8') as fout:
for chunk in chunked:
fout.write(str(chunk)+'\n\n')
[OUT]:
alvas@ubi:~$ python test2.py
Traceback (most recent call last):
File "test2.py", line 18, in <module>
fout.write(str(chunk)+'\n\n')
TypeError: must be unicode, not str
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile
(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TO
as/IN
(Chunk digital/JJ library/NN)
or/CC
如果你必须坚持使用python2.7:
with io.open('outfile', 'w', encoding='utf8') as fout:
for chunk in chunked:
fout.write(unicode(chunk)+'\n\n')
[OUT]:
alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile
(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TO
as/IN
(Chunk digital/JJ library/NN)
or/CC
alvas@ubi:~$ python3 test2.py
Traceback (most recent call last):
File "test2.py", line 18, in <module>
fout.write(unicode(chunk)+'\n\n')
NameError: name 'unicode' is not defined
强烈建议你必须坚持使用py2.7:
from six import text_type
with io.open('outfile', 'w', encoding='utf8') as fout:
for chunk in chunked:
fout.write(text_type(chunk)+'\n\n')
[OUT]:
alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile
(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TO
as/IN
(Chunk digital/JJ library/NN)
or/CC
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile
(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TO
as/IN
(Chunk digital/JJ library/NN)
or/CC
答案 1 :(得分:5)
您的代码有几个问题,但主要原因是您的for
循环不会修改xstring
的内容:
我将在此处解决您代码中的所有问题:
你无法使用单个\
编写这样的路径,因为\t
将被解释为制表符,\f
将被解释为换行符。你必须加倍他们。我知道这是一个例子,但经常出现这样的混淆:
with open('path\\to\\file.txt', 'r') as infile:
xstring = infile.readlines()
以下infile.close
行错误。它不会调用close方法,它实际上并没有做任何事情。此外,如果您在任何地方的任何答案中看到此行,则您的文件已经关闭了,请直接将答案与评论说明file.close
错误,应该是{ {1}}。
以下内容应该有效,但您需要注意,它会用file.close()
替换每个非ascii字符,这样会破坏naïve和café等字词
' '
但是这就是你的代码因unicode异常而失败的原因:你根本没有修改def remove_non_ascii(line):
return ''.join([i if ord(i) < 128 else ' ' for i in line])
的元素,也就是说,你正在计算删除了ascii字符的行,是的,但那是一个新值,永远不会存储到列表中:
xstring
相反它应该是:
for i, line in enumerate(xstring):
line = remove_non_ascii(line)
或我喜欢的非常pythonic:
for i, line in enumerate(xstring):
xstring[i] = remove_non_ascii(line)
虽然这些Unicode错误的发生主要是因为你使用的是Python 2.7来处理纯Unicode文本,但是最近的Python 3版本是未来的,所以如果你刚开始执行任务,我建议你很快升级到Python 3.4+。