如何提取附加html的信息并在文本文件中保存以下内容: 段落ID \ t TokenID \ t TokenCoordinates \ t TokenContent
因此,例如,第一行应如下所示:
T102633 1 109,18,110,18 IV
T102634 1 527,29,139,16 Seit
...
我想使用python。目前,我有以下内容:
root = lxml.html.parse('html-file').getroot()
tables = root.cssselect('table.main')
tables = root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')
for elem in root.xpath("//span[@class='finereader']"):
text = (elem.text or "") + (elem.tail or "")
if elem.getprevious() is not None: # If there's a previous node
previous = elem.getprevious()
previous.tail = (previous.tail or "") + text # append to its tail
else:
parent = elem.getparent() # Otherwise use the parent
parent.text = (parent.text or "") + text # and append to its text
elem.getparent().remove(elem)
txt = []
txt += ([lxml.etree.tostring(t, method="html", encoding="utf-8") for t in tables])
text = "\n".join(el for el in txt)
output.write(text.decode("utf-8"))
这给了我这样的东西:
[:T102633-1 coord =" 109,18,110,18":] IV [:/ T102633-1:]
现在,很明显我可以使用string-find-method来提取我想要的信息。但是没有更优雅的解决方案吗?使用" .attrib"或类似的东西? 谢谢你的帮助!
在这里,可以找到html:http://tinyurl.com/qjvsp4n
答案 0 :(得分:0)
使用BeautifulSoup的此代码提供了您感兴趣的所有span
:
from bs4 import BeautifulSoup
html_file = open('html_file')
soup = BeautifulSoup(html_file)
table = soup.find('table', attrs={'class':'main'})
# The first two tr's dont seem to contain the info you need,
# so get rid of them
rows = table.find_all('tr')[2:]
for row in rows:
data = row.find_all('td')[1]
span_element = data.find_all('span')
for ele in span_element:
print ele.text
获得格式[:T102639-3 coord="186,15,224,18":]L.[:/T102639-3:]
的数据后,使用python regex模块获取内容。
import re
pattern = re.compile('\[:(.*):\](.*)\[:\/(.*):\]')
data = "[:T102639-3 coord="186,15,224,18":]L.[:/T102639-3:]"
res = re.search(pattern, data)
# res.group(1).split()[0] then gives 'T102639-3'
# res.group(1).split()[1] gives coord="186,15,224,18"
# res.group(2) gives 'L.'