我正在解决一个使用python刮刮webtable的问题。一段时间以来,我一直在抓取我所谓的“标准”表,我觉得我对此非常了解。我将标准表定义为具有以下结构:
<table>
<tr class="row-class">
<th>Bill</th>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="row-class">
<th>Ben</th>
<td>2</td>
<td>3</td>
<td>4</td>
<td>1</td>
</tr>
<tr class="row-class">
<th>Barry</th>
<td>3</td>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>
</table>
我现在遇到了一个表实例,该实例的结构略有不同,我不知道如何以所需的格式从中获取数据。我现在尝试抓取的格式是:
<table>
<tr class="row-class">
<th>Bill</th></tr>
<tr><td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="row-class">
<th>Ben</th></tr>
<tr>
<td>2</td>
<td>3</td>
<td>4</td>
<td>1</td>
</tr>
<tr class="row-class">
<th>Barry</th></tr>
<tr>
<td>3</td>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>
</table>
我想要实现的输出是:
Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2
我假设遇到的问题是,因为标头存储在单独的tr行中,所以我只能得到以下输出:
Bill
Ben
Barry
我想知道解决方案是否是遍历行并确定下一个标记是th还是td,然后执行适当的操作?我会很感激关于如何修改用于测试此代码的任何建议,以实现所需的输出。代码是:
from bs4 import BeautifulSoup
t_obj = """<tr class="row-class">
<th>Bill</th></tr>
<tr><td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="row-class">
<th>Ben</th></tr>
<tr>
<td>2</td>
<td>3</td>
<td>4</td>
<td>1</td>
</tr>
<tr class="row-class">
<th>Barry</th></tr>
<tr>
<td>3</td>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>"""
soup = BeautifulSoup(t_obj)
trs = soup.find_all("tr", {"class":"row-class"})
for tr in trs:
for th in tr.findAll('th'):
print (th.get_text())
for td in tr.findAll('td'):
print(td.get_text())
print(td.get_text())
答案 0 :(得分:3)
在这里,我使用3种方法将两个<tr>
标签配对在一起:
zip()
和CSS选择器find_next_sibling()
zip()
并使用自定义步骤进行简单切片from bs4 import BeautifulSoup
t_obj = """<tr class="row-class">
<th>Bill</th></tr>
<tr><td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="row-class">
<th>Ben</th></tr>
<tr>
<td>2</td>
<td>3</td>
<td>4</td>
<td>1</td>
</tr>
<tr class="row-class">
<th>Barry</th></tr>
<tr>
<td>3</td>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>"""
soup = BeautifulSoup(t_obj, 'html.parser')
for tr1, tr2 in zip(soup.select('tr.row-class'), soup.select('tr.row-class ~ tr:not(.row-class)')):
print( ','.join(tag.get_text() for tag in tr1.select('th') + tr2.select('td')) )
print()
for tr in soup.select('tr.row-class'):
print( ','.join(tag.get_text() for tag in tr.select('th') + tr.find_next_sibling('tr').select('td')) )
print()
trs = soup.select('tr')
for tr1, tr2 in zip(trs[::2], trs[1::2]):
print( ','.join(tag.get_text() for tag in tr1.select('th') + tr2.select('td')) )
打印:
Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2
Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2
Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2
答案 1 :(得分:0)
您可以使用索引:
from bs4 import BeautifulSoup as soup
d = soup(html, 'html.parser').find_all('tr')
result = [[d[i].text]+[c.text for c in d[i+1].find_all('td')] for i in range(0, len(d), 2)]
要打印结果:
print('\n'.join(f'{a[1:]},{",".join(b)}' for a, *b in result))
输出:
Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2
答案 2 :(得分:0)
处理HTML以使其适合
from simplified_scrapy.simplified_doc import SimplifiedDoc
t_obj = """<tr class="row-class">
<th>Bill</th></tr>
<tr><td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="row-class">
<th>Ben</th></tr>
<tr>
<td>2</td>
<td>3</td>
<td>4</td>
<td>1</td>
</tr>
<tr class="row-class">
<th>Barry</th></tr>
<tr>
<td>3</td>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>"""
doc = SimplifiedDoc()
doc.loadHtml(doc.replaceReg(t_obj,"</tr>\s*<tr>",''))# merge tr
trs = doc.trs # get all tr
for tr in trs:
tds = tr.children # get td and th
data = [td.text for td in tds]
print (data)
结果:
['Bill', '1', '2', '3', '4']
['Ben', '2', '3', '4', '1']
['Barry', '3', '4', '1', '2']