带有多个标题行的Python Web抓取表

时间:2019-12-20 23:51:05

标签: python web-scraping beautifulsoup

我正在解决一个使用python刮刮webtable的问题。一段时间以来,我一直在抓取我所谓的“标准”表,我觉得我对此非常了解。我将标准表定义为具有以下结构:

<table>
<tr class="row-class">
  <th>Bill</th>
  <td>1</td>
  <td>2</td>
  <td>3</td>
  <td>4</td>
</tr>
<tr class="row-class">
  <th>Ben</th>
  <td>2</td>
  <td>3</td>
  <td>4</td>
  <td>1</td>
</tr>
<tr class="row-class">
  <th>Barry</th>
  <td>3</td>
  <td>4</td>
  <td>1</td>
  <td>2</td>
</tr>
</table>

我现在遇到了一个表实例,该实例的结构略有不同,我不知道如何以所需的格式从中获取数据。我现在尝试抓取的格式是:

<table>
<tr class="row-class">
  <th>Bill</th></tr>
  <tr><td>1</td>
  <td>2</td>
  <td>3</td>
  <td>4</td>
</tr>
<tr class="row-class">
  <th>Ben</th></tr>
  <tr>
  <td>2</td>
  <td>3</td>
  <td>4</td>
  <td>1</td>
</tr>
<tr class="row-class">
  <th>Barry</th></tr>
  <tr>
  <td>3</td>
  <td>4</td>
  <td>1</td>
  <td>2</td>
</tr>
</table>

我想要实现的输出是:

Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2

我假设遇到的问题是,因为标头存储在单独的tr行中,所以我只能得到以下输出:

Bill
Ben
Barry

我想知道解决方案是否是遍历行并确定下一个标记是th还是td,然后执行适当的操作?我会很感激关于如何修改用于测试此代码的任何建议,以实现所需的输出。代码是:

from bs4 import BeautifulSoup

t_obj = """<tr class="row-class">
  <th>Bill</th></tr>
  <tr><td>1</td>
  <td>2</td>
  <td>3</td>
  <td>4</td>
</tr>
<tr class="row-class">
  <th>Ben</th></tr>
  <tr>
  <td>2</td>
  <td>3</td>
  <td>4</td>
  <td>1</td>
</tr>
<tr class="row-class">
  <th>Barry</th></tr>
  <tr>
  <td>3</td>
  <td>4</td>
  <td>1</td>
  <td>2</td>
</tr>"""


soup = BeautifulSoup(t_obj)

trs = soup.find_all("tr", {"class":"row-class"})

for tr in trs:
    for th in tr.findAll('th'):
        print (th.get_text())
        for td in tr.findAll('td'):
            print(td.get_text())
            print(td.get_text())

3 个答案:

答案 0 :(得分:3)

在这里,我使用3种方法将两个<tr>标签配对在一起:

  • 第一种方法是使用zip()和CSS选择器
  • 第二种方法正在使用BeautifulSoup的方法find_next_sibling()
  • 第三种方法是使用zip()并使用自定义步骤进行简单切片

from bs4 import BeautifulSoup

t_obj = """<tr class="row-class">
  <th>Bill</th></tr>
  <tr><td>1</td>
  <td>2</td>
  <td>3</td>
  <td>4</td>
</tr>
<tr class="row-class">
  <th>Ben</th></tr>
  <tr>
  <td>2</td>
  <td>3</td>
  <td>4</td>
  <td>1</td>
</tr>
<tr class="row-class">
  <th>Barry</th></tr>
  <tr>
  <td>3</td>
  <td>4</td>
  <td>1</td>
  <td>2</td>
</tr>"""


soup = BeautifulSoup(t_obj, 'html.parser')

for tr1, tr2 in zip(soup.select('tr.row-class'), soup.select('tr.row-class ~ tr:not(.row-class)')):
    print( ','.join(tag.get_text() for tag in tr1.select('th') + tr2.select('td')) )

print()

for tr in soup.select('tr.row-class'):
    print( ','.join(tag.get_text() for tag in tr.select('th') + tr.find_next_sibling('tr').select('td')) )

print()

trs = soup.select('tr')
for tr1, tr2 in zip(trs[::2], trs[1::2]):
    print( ','.join(tag.get_text() for tag in tr1.select('th') + tr2.select('td')) )

打印:

Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2

Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2

Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2

答案 1 :(得分:0)

您可以使用索引:

from bs4 import BeautifulSoup as soup
d = soup(html, 'html.parser').find_all('tr')
result = [[d[i].text]+[c.text for c in d[i+1].find_all('td')] for i in range(0, len(d), 2)]

要打印结果:

print('\n'.join(f'{a[1:]},{",".join(b)}' for a, *b in result))

输出:

Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2

答案 2 :(得分:0)

处理HTML以使其适合

from simplified_scrapy.simplified_doc import SimplifiedDoc 
t_obj = """<tr class="row-class">
  <th>Bill</th></tr>
  <tr><td>1</td>
  <td>2</td>
  <td>3</td>
  <td>4</td>
</tr>
<tr class="row-class">
  <th>Ben</th></tr>
  <tr>
  <td>2</td>
  <td>3</td>
  <td>4</td>
  <td>1</td>
</tr>
<tr class="row-class">
  <th>Barry</th></tr>
  <tr>
  <td>3</td>
  <td>4</td>
  <td>1</td>
  <td>2</td>
</tr>"""
doc = SimplifiedDoc()
doc.loadHtml(doc.replaceReg(t_obj,"</tr>\s*<tr>",''))# merge tr
trs = doc.trs # get all tr
for tr in trs:
  tds = tr.children # get td and th
  data = [td.text for td in tds]
  print (data) 

结果:

['Bill', '1', '2', '3', '4']
['Ben', '2', '3', '4', '1']
['Barry', '3', '4', '1', '2']