Question

我正在使用BeautifulSoup将HTML表格的元素拉入python dict。我遇到的问题是，当我创建dict时，表中的第一条记录会重复加载到dict中。打印变量行显示响应中返回的预期不同记录数，但只有在调用print（d）时才会打印第一个记录。

import requests
from bs4 import BeautifulSoup as bs

url = 'http://host.com/user_activity?page=3'
r = requests.get(url)
#print(r.text)

soup = bs(r.text, 'lxml')
table = soup.find_all('table')[0]
rows = table.find_all('td')
#records = soup.find_all('td')


#print(table.prettify())

ct=0
for record in rows :
    if ct < 20:
        keys = [th.get_text(strip=True)for th in table.find_all('th')]
        values = [td.get_text(strip=True) for td in rows]
        d = dict(zip(keys, values))
        print(d)
        ct+=1

Answer 1

我认为你的意思是从表格的第一行（一次，在循环之前）获取标题单元格，并迭代tr元素而不是td。

您还可以使用常规find()代替find_all()[0]和enumerate()来更好地处理循环增量变量：

table = soup.find('table')
rows = table.find_all('tr')

headers = [th.get_text(strip=True) for th in rows[0].find_all('th')]

for ct, row in enumerate(rows[1:]):
    values = [td.get_text(strip=True) for td in row.find_all('td')]

    d = dict(zip(headers, values))
    print(d)

Answer 2

除了alecxe先生已经展示的内容之外，你可以使用选择器这样做。只需确保表索引是准确的，如第一个表或第二个表或您要解析的另一个表。

EndEdit()

Python Web Scraping Script没有正确地迭代HTML表

2 个答案: