我是Python的初学者。我有一个文本文件,如下所示,其中包含数千个文档(从id = 1到id = 10000):
<doc id=1>
<label>1</label>
<summary>
I think you are right
</summary>
<short_text>
I think you are right. Because I have once read the book in the same topic.
</short_text>
</doc>
是否有任何便捷的方法来读取文本文件并将内容存储在实例中?
class ShortText:
def __init__(self, my_id, human_label, summary, short_text):
self.id = my_id
self.human_label = human_label
self.summary = summary
self.short_text = short_text
def __str__(self):
'''
For printing purposes.
'''
return '%d\t%s\t%s\t%s' % (self.id, self.human_label, self.summary, self.short_text)
def load_file(filename):
#retrieve the original text
with codecs.open(filename, encoding='utf-8') as f:
data = f.read()
#how to get values from tags and put it below?
my_id =
human_label =
summary =
short_text =
instances[my_id] = ShortText(my_id, human_label, summary, short_text)
return instances
答案 0 :(得分:1)
如果您可以将数据视为XML片段,则可以尝试使用lxml
库:
test.py:
from lxml import etree
a = etree.fromstring("<test>Hello</test>")
print a.text
结果
>>> python test.py
Hello
从文件读取:
>>> tree = etree.parse(some_file_or_file_like_object)
答案 1 :(得分:1)
BeautifulSoup解决了这个问题。
import codecs
from bs4 import BeautifulSoup
class ShortText:
def __init__(self, my_id, human_label, summary, short_text):
self.id = my_id
self.human_label = human_label
self.summary = summary
self.short_text = short_text
def __str__(self):
'''
For printing purposes.
'''
return '%d\t%d\t%s\t%s' % (self.id, self.human_label, self.summary, self.short_text)
def load_file(filename):
#retrieve the original text
with codecs.open(filename, encoding='utf-8') as f:
data = f.read()
#use beautifulsoup to get tag attributes and elements
soup = BeautifulSoup(data)
tags = soup.find_all('doc')
#store in a dictionary with ShortText Instances as values
instances = {}
my_id = 0
for t in tags:
human_label = int(t.human_label.get_text())
summary = t.summary.get_text().replace("\n", "").replace(" ", "")
short_text = t.short_text.get_text().replace("\n", "").replace(" ", "")
instances[my_id] = ShortText(my_id, human_label, summary, short_text)
my_id +=1
return instances
谢谢你们!
答案 2 :(得分:-1)
尝试一下。您可能会看到'\ n'字符,这些字符只是换行符,可以被第三行代码删除(如有必要):
from bs4 import BeautifulSoup
d = BeautifulSoup(data)
d = d.text.replace('\n','')