使用BeautifulSoup按标头获取行

时间:2017-11-16 13:19:41

标签: python web-scraping beautifulsoup

我有一个像这样的html结构:

<table class="info" id="stats">
 <tbody>
  <tr>
   <th> Brand </th>
   <td> 2 Guys Smoke Shop </td>
  </tr>
  <tr>
   <th> Blend Type </th>
   <td> Aromatic </td>
  </tr>
  <tr>
   <th> Contents </th>
   <td> Black Cavendish, Virginia </td>
  </tr>
  <tr>
   <th> Flavoring </th>
   <td> Other / Misc </td>
  </tr>
 </tbody>
</table>

这些属性并不总是存在,有时我只能Brand,其他情况BrandFlavoring

要废弃这个,我做了一个像这样的代码:

BlendInfo = namedtuple('BlendInfo', ['brand', 'type', 'contents', 'flavoring'])
stats_rows =  soup.find('table', id='stats').find_all('tr')
bi = BlendInfo(brand      = stats_rows[1].td.get_text(),
               type       = stats_rows[2].td.get_text(),
               contents   = stats_rows[3].td.get_text(),
               flavoring  = stats_rows[4].td.get_text())

但是正如预期的那样,当表排序不同(在品牌之前输入)或某些行缺失(没有内容)时,它会失败并带有索引输出边界(或者变得非常混乱)。

有没有比这更好的方法:

向我提供带有标题字符串&#39;品牌&#39;

的行中的数据

2 个答案:

答案 0 :(得分:2)

绝对有可能。看看这个:

from bs4 import BeautifulSoup

html_content='''
<table class="info" id="stats">
 <tbody>
  <tr>
   <th> Brand </th>
   <td> 2 Guys Smoke Shop </td>
  </tr>
  <tr>
   <th> Blend Type </th>
   <td> Aromatic </td>
  </tr>
  <tr>
   <th> Contents </th>
   <td> Black Cavendish, Virginia </td>
  </tr>
  <tr>
   <th> Flavoring </th>
   <td> Other / Misc </td>
  </tr>
 </tbody>
</table>
'''
soup = BeautifulSoup(html_content,"lxml")
for item in soup.find_all(class_='info')[0].find_all("th"):
    header = item.text
    rows = item.find_next_sibling().text
    print(header,rows)

输出:

 Brand   2 Guys Smoke Shop 
 Blend Type   Aromatic 
 Contents   Black Cavendish, Virginia 
 Flavoring   Other / Misc

答案 1 :(得分:1)

这会为你建立一个字典:

from BeautifulSoup import BeautifulSoup

valid_headers = ['brand', 'type', 'contents', 'flavoring']

t = """<table class="info" id="stats">
 <tbody>
  <tr>
   <th> Brand </th>
   <td> 2 Guys Smoke Shop </td>
  </tr>
  <tr>
   <th> Blend Type </th>
   <td> Aromatic </td>
  </tr>
  <tr>
   <th> Contents </th>
   <td> Black Cavendish, Virginia </td>
  </tr>
  <tr>
   <th> Flavoring </th>
   <td> Other / Misc </td>
  </tr>
 </tbody>
</table>"""

bs = BeautifulSoup(t)

results = {}
for row in bs.findAll('tr'):
    hea = row.findAll('th')
    if hea.strip().lstrip().lower() in valid_headers:
        val = row.findAll('td')
        results[hea[0].string] = val[0].string

print results