Question

我有一个像这样的html结构：

<table class="info" id="stats">
 <tbody>
  <tr>
   <th> Brand </th>
   <td> 2 Guys Smoke Shop </td>
  </tr>
  <tr>
   <th> Blend Type </th>
   <td> Aromatic </td>
  </tr>
  <tr>
   <th> Contents </th>
   <td> Black Cavendish, Virginia </td>
  </tr>
  <tr>
   <th> Flavoring </th>
   <td> Other / Misc </td>
  </tr>
 </tbody>
</table>

这些属性并不总是存在，有时我只能Brand，其他情况Brand和Flavoring。

要废弃这个，我做了一个像这样的代码：

BlendInfo = namedtuple('BlendInfo', ['brand', 'type', 'contents', 'flavoring'])
stats_rows =  soup.find('table', id='stats').find_all('tr')
bi = BlendInfo(brand      = stats_rows[1].td.get_text(),
               type       = stats_rows[2].td.get_text(),
               contents   = stats_rows[3].td.get_text(),
               flavoring  = stats_rows[4].td.get_text())

但是正如预期的那样，当表排序不同（在品牌之前输入）或某些行缺失（没有内容）时，它会失败并带有索引输出边界（或者变得非常混乱）。

有没有比这更好的方法：

向我提供带有标题字符串＆＃39;品牌＆＃39;

的行中的数据

Answer 1

绝对有可能。看看这个：

from bs4 import BeautifulSoup

html_content='''
<table class="info" id="stats">
 <tbody>
  <tr>
   <th> Brand </th>
   <td> 2 Guys Smoke Shop </td>
  </tr>
  <tr>
   <th> Blend Type </th>
   <td> Aromatic </td>
  </tr>
  <tr>
   <th> Contents </th>
   <td> Black Cavendish, Virginia </td>
  </tr>
  <tr>
   <th> Flavoring </th>
   <td> Other / Misc </td>
  </tr>
 </tbody>
</table>
'''
soup = BeautifulSoup(html_content,"lxml")
for item in soup.find_all(class_='info')[0].find_all("th"):
    header = item.text
    rows = item.find_next_sibling().text
    print(header,rows)

输出：

 Brand   2 Guys Smoke Shop 
 Blend Type   Aromatic 
 Contents   Black Cavendish, Virginia 
 Flavoring   Other / Misc

Answer 2

这会为你建立一个字典：

from BeautifulSoup import BeautifulSoup

valid_headers = ['brand', 'type', 'contents', 'flavoring']

t = """<table class="info" id="stats">
 <tbody>
  <tr>
   <th> Brand </th>
   <td> 2 Guys Smoke Shop </td>
  </tr>
  <tr>
   <th> Blend Type </th>
   <td> Aromatic </td>
  </tr>
  <tr>
   <th> Contents </th>
   <td> Black Cavendish, Virginia </td>
  </tr>
  <tr>
   <th> Flavoring </th>
   <td> Other / Misc </td>
  </tr>
 </tbody>
</table>"""

bs = BeautifulSoup(t)

results = {}
for row in bs.findAll('tr'):
    hea = row.findAll('th')
    if hea.strip().lstrip().lower() in valid_headers:
        val = row.findAll('td')
        results[hea[0].string] = val[0].string

print results

使用BeautifulSoup按标头获取行

2 个答案: