使用Beautiful Soup在XML表中提取列

时间:2017-06-15 06:24:19

标签: python html xml beautifulsoup

我有多个表,如下面的MySQL datadump表,每个表代表数据库中的一行。我想提取以下信息,以便将其迁移到不同的数据库。

<table name="dashboard">
  <column name="id">1</column>
  <column name="timestamp">2009-10-09 15:10:30</column>
  <column name="config_offline">1</column>
  <column name="item1">0.00</column>
  <column name="item2">0.00</column>
</table>

<table name="orders">
  <column name="id">1</column>
  <column name="timestamp">2016-08-04 08:39:13</column>
  <column name="item">1</column>
  <column name="payment">Check</column>
  <column name="cost">175.00</column>
  <column name="paid">175.00</column>
  <column name="cancel">0</column>
  <column name="received">1</column>
</table>

以下是我目前正在尝试的内容:

from bs4 import BeautifulSoup

with open("test.xml", "r") as markup:
    soup = BeautifulSoup(markup, "xml")

for row in soup.find_all('column'):
    print(row.text)
with open("test.xml", "r") as markup:
soup = BeautifulSoup(markup, "xml")
# And I also try this, but this doesn't work neither. 
for row in soup.find_all('table'):
    for c in row.find_all('column'):
       print(c.text)

这种方法的问题现在我无法区分这两个表名。有没有办法可以分别从两个不同的表中提取信息?

2 个答案:

答案 0 :(得分:1)

您可以按特定属性找到特定的表格:

import bs4
div_test="""  
<table name="dashboard">
  <column name="id">1</column>
  <column name="timestamp">2009-10-09 15:10:30</column>
  <column name="config_offline">1</column>
  <column name="item1">0.00</column>
  <column name="item2">0.00</column>
</table>
<table name="orders">
  <column name="id">1</column>
  <column name="timestamp">2016-08-04 08:39:13</column>
  <column name="item">1</column>
  <column name="payment">Check</column>
  <column name="cost">175.00</column>
  <column name="paid">175.00</column>
  <column name="cancel">0</column>
  <column name="received">1</column>
</table>
"""
soup = bs4.BeautifulSoup(div_test)
table_dashboard = soup.find('table', {'name':"dashboard"})
table_orders = soup.find('table', {'name':"orders"})
print table_dashboard
print '\n'
print table_orders

输出会为您提供table_dashboardtable_orders

<table name="dashboard">
<column name="id">1</column>
<column name="timestamp">2009-10-09 15:10:30</column>
<column name="config_offline">1</column>
<column name="item1">0.00</column>
<column name="item2">0.00</column>
</table>


<table name="orders">
<column name="id">1</column>
<column name="timestamp">2016-08-04 08:39:13</column>
<column name="item">1</column>
<column name="payment">Check</column>
<column name="cost">175.00</column>
<column name="paid">175.00</column>
<column name="cancel">0</column>
<column name="received">1</column>
</table>

答案 1 :(得分:0)

似乎显而易见......迭代&#34;表&#34;首先标记每个&#34;表&#34;标签在&#34;列&#34;标签