我是python的新手,我正在尝试使用以下代码将所有标识为.xml文件中位置的标记打印到.txt文件中:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('exercise-ner.xml', 'r'))
tokenlist = soup.find_all('token')
output = ''
for x in tokenlist:
readeachtoken = x.ner.encode_contents()
checktoseeifthetokenisalocation = x.ner.encode_contents().find("LOCATION")
if checktoseeifthetokenisalocation != -1:
output += "\n%s" % x.word.encode_contents()
z = open('exercise-places.txt','w')
z.write(output)
z.close()
程序工作,并吐出所有位置的标记列表,每个标记都在输出文件中自己打印。然而,我想要做的是修改我的程序,以便任何时候美丽的汤找到两个或多个相邻的标识为位置的标记,它可以将这些标记打印到输出文件中的同一行。有谁知道如何修改我的代码来实现这一目标?我非常感谢您提供的任何建议。
答案 0 :(得分:0)
这个问题已经很老了,但我刚收到你的说明@Amanda,我想我会发布我的方法,以防其他人:
import glob, codecs
from bs4 import BeautifulSoup
inside_location = 0
location_string = ''
with codecs.open("washington_locations.txt","w","utf-8") as out:
for i in glob.glob("/afs/crc.nd.edu/user/d/dduhaime/java/stanford-corenlp-full-2015-01-29/processed_washington_correspondence/*.xml"):
locations = []
with codecs.open(i,'r','utf-8') as f:
soup = BeautifulSoup(f.read())
tokens = soup.findAll('token')
for token in tokens:
if token.ner.string == "LOCATION":
inside_location = 1
location_string += token.word.string + u" "
else:
if location_string:
locations.append( location_string )
location_string = ''
out.write( i + "\t" + "\t".join(l for l in locations) + "\n" )