如何修改Python代码以便将多个相邻的“位置”标记打印到单行输出

时间:2012-10-07 23:24:41

标签: python beautifulsoup

我是python的新手,我正在尝试使用以下代码将所有标识为.xml文件中位置的标记打印到.txt文件中:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('exercise-ner.xml', 'r'))

tokenlist = soup.find_all('token')

output = ''

for x in tokenlist:

   readeachtoken = x.ner.encode_contents()

   checktoseeifthetokenisalocation = x.ner.encode_contents().find("LOCATION")

   if checktoseeifthetokenisalocation != -1:

   output += "\n%s" % x.word.encode_contents()

z = open('exercise-places.txt','w')

z.write(output)

z.close()

程序工作,并吐出所有位置的标记列表,每个标记都在输出文件中自己打印。然而,我想要做的是修改我的程序,以便任何时候美丽的汤找到两个或多个相邻的标识为位置的标记,它可以将这些标记打印到输出文件中的同一行。有谁知道如何修改我的代码来实现这一目标?我非常感谢您提供的任何建议。

1 个答案:

答案 0 :(得分:0)

这个问题已经很老了,但我刚收到你的说明@Amanda,我想我会发布我的方法,以防其他人:

import glob, codecs
from bs4 import BeautifulSoup

inside_location = 0
location_string = ''

with codecs.open("washington_locations.txt","w","utf-8") as out:
    for i in glob.glob("/afs/crc.nd.edu/user/d/dduhaime/java/stanford-corenlp-full-2015-01-29/processed_washington_correspondence/*.xml"):
        locations = []

        with codecs.open(i,'r','utf-8') as f:
            soup   = BeautifulSoup(f.read())
            tokens = soup.findAll('token')
            for token in tokens:
                if token.ner.string     == "LOCATION":
                    inside_location = 1
                    location_string += token.word.string + u" "
                else:
                    if location_string:
                        locations.append( location_string )
                        location_string = ''

        out.write( i + "\t" + "\t".join(l for l in locations) + "\n" )