我正在关注this question和this other one以解析维基百科中的表格。
具体来说,我想获取所有行,并在每行中转储每列的内容。
我的代码使用MacOS X下的xml
库,但我得到的只是空行列表。
import xml.etree.ElementTree
s = open("wikiactors20century.txt", "r").read()
# tree = xml.etree.ElementTree.fromstring(s)
# rows = tree.findall()
# headrow = rows[0]
# datarows = rows[1:]
#
# for num, h in enumerate(headrow):
# data = ", ".join([row[num].text for row in datarows])
# print "{0:<16}: {1}".format(h.text, data)
table = xml.etree.ElementTree.XML(s)
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
values = [col.text for col in row]
print dict(zip(headers, values))
输入文件has been pasted here in PasteBin。 xml.etree.ElementTree.fromstring
和xml.etree.ElementTree.XML
版本都无法检索行列表。但是,如果我制作一个虚拟表格
s = "<table> <tr><td>a</td><td>1</td></tr> <tr><td>b</td><td>2</td></tr> <tr><td>c</td><td>3</td></tr> </table>"
然后解析工作正常。
我做错了什么?在解析文件之前是否必须进行一些清理?
答案 0 :(得分:1)
您的尝试与wikipedia示例的结构不同。
>>> list(table)
[<Element 'thead' at 0x7ff0fdb73f50>, <Element 'tbody' at 0x7ff0fdb78590>, <Element 'tfoot' at 0x7ff0fb995a90>]
您可以使用以下命令获取标题名称:
>>> columns = list(k.text for k in table[0][0])
然后每行构建数据表:
>>> data_table = list(dict(zip(columns, list(v.text for v in row))) for row in table[1])
>>> print(json.dumps(data_table, indent=2))
[
{
"L,S": "L",
"Cause of death": "~",
"null": "F",
"Noms": "1",
"Wins": "0",
"Age": "26",
"Actor": null,
"Born": "1990",
"Film": null,
"Last": "~",
"WoF": "~",
"Died": "~",
"First": "2001"
},
{
"L,S": "1L,1S",
"Cause of death": "~",
"null": "M",
"Noms": "2",
"Wins": "0",
"Age": "39",
"Actor": null,
"Born": "1977",
"Film": null,
"Last": "~",
"WoF": "~",
"Died": "~",
"First": "2001"
},
[...]
注意:链接和内部标记存在一些解析问题。它可以通过itertext
或更深入的解析来解决。