Question

我想遍历一个 beautifulsoup 对象，该对象根据它找到的与 HTML 标签匹配的元素数量来改变长度。

driver.get('https://www.inspection.gc.ca/food-recall-warnings-and-allergy-alerts/2021-02-10/eng/1613010591343/1613010596418')
page_source = driver.page_source

soup = BeautifulSoup(page_source, 'html.parser')
recall_details = soup.find('table', class_ = 'table table-bordered table-condensed')

recalled_products = recall_details.find_all('td')
recalled_products

输出：

[<td>One Ocean</td>,
 <td>Sliced Smoked  Wild Sockeye Salmon</td>,
 <td>300 g</td>,
 <td>6 25984 00005 3</td>,
 <td>11253</td>]

我想遍历每个 td 元素并附加到这样的列表中：

brands = []
products = []
sizes = []
upcs = []
codes = []

brand = recalled_products[0].text
product = recalled_products[1].text
size = recalled_products[2].text
upc = recalled_products[3].text
code = recalled_products[4].text
brands.append(brand)
products.append(product)
sizes.append(size)
upcs.append(upc)
codes.append(code)

print(brands)
print(products)
print(sizes)
print(upcs)
print(codes)

输出：

['One Ocean']
['Sliced Smoked  Wild Sockeye Salmon']
['300\xa0g']
['6\xa025984\xa000005\xa03']
['11253']

我尝试了以下代码，但没有得到预期的结果。我想我需要某种计数器。

for i in range(len(recalled_products)):
    brand = recalled_products[i].text
    product = recalled_products[i].text
    size = recalled_products[i].text
    upc = recalled_products[i].text
    code = recalled_products[i].text
    brands.append(brand)
    products.append(product)
    sizes.append(size)
    upcs.append(upc)
    codes.append(code)

print(brands)
print(products)
print(sizes)
print(upcs)
print(codes)
```

Output:

```
['One Ocean', 'Sliced Smoked  Wild Sockeye Salmon', '300\xa0g', '6\xa025984\xa000005\xa03', '11253']
['One Ocean', 'Sliced Smoked  Wild Sockeye Salmon', '300\xa0g', '6\xa025984\xa000005\xa03', '11253']
['One Ocean', 'Sliced Smoked  Wild Sockeye Salmon', '300\xa0g', '6\xa025984\xa000005\xa03', '11253']
['One Ocean', 'Sliced Smoked  Wild Sockeye Salmon', '300\xa0g', '6\xa025984\xa000005\xa03', '11253']
['One Ocean', 'Sliced Smoked  Wild Sockeye Salmon', '300\xa0g', '6\xa025984\xa000005\xa03', '11253']

这是网站的示例 html 代码

预先感谢您提供的任何帮助。

Answer 1

关于数据的问题是从返回

recalled_products = recall_details.find_all('td') 

A = [[<td>beef</td>,
     <td>250g</td>,
     <td>6 25984 00005 3</td>,
     <td>11253</td>],
     [<td>Salmon</td>,
     <td>300 g</td>,
     <td>6 25984 00005 3</td>,
     <td>11253</td>]]

或

b = [<td>beef</td>,
     <td>250g</td>,
     <td>6 25984 00005 3</td>,
     <td>11253</td>,
     <td>Salmon</td>,
     <td>300 g</td>,
     <td>6 25984 00005 3</td>,
     <td>11253</td>]

对于 A，您想使用索引二维数组

for i in range(len(recalled_products)):
    brand = recalled_products[i][0].text
    product = recalled_products[i][1].text

对于 B，您想在迭代中使用一个步骤

    for i in range(0,len(recalled_products),4):
      brand = recalled_products[i].text
      product = recalled_products[i+1].text

Answer 2

这就是我获取标记的方式。

from bs4 import BeautifulSoup
import requests

URL = "https://www.inspection.gc.ca/food-recall-warnings-and-allergy-alerts/2021-02-10/eng/1613010591343/1613010596418"

brands = []
products = []
sizes = []
upcs = []
codes = []

page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

recall_details = soup.find("table", class_="table table-bordered table-condensed")

body = recall_details.find("tbody")

rows = body.find_all("tr")

for row in rows:
    data = row.find_all("td")
    brands.append(data[0].text)
    products.append(data[1].text)
    sizes.append(data[2].text)
    upcs.append(data[3].text)
    codes.append(data[4].text)

印刷品

['One Ocean']
['Sliced Smoked  Wild Sockeye Salmon']
['300\xa0g']
['6\xa025984\xa000005\xa03']
['11253']

我确实认为 dict 是比多个列表更好的数据结构，但当然这因您的用例而异。

如果你想这样做，你可以像这样更改代码：


recalled = []

...

for row in rows:
    data = row.find_all("td")
    item = {
        "brand": data[0].text,
        "products": data[1].text,
        "sizes": data[2].text,
        "upcs": data[3].text,
        "codes": data[4].text,
    }
    recalled.append(item)

印刷品

[{'brand': 'One Ocean', 'products': 'Sliced Smoked  Wild Sockeye Salmon', 'sizes': '300\xa0g', 'upcs': '6\xa025984\xa000005\xa03', 'codes': '11253'}]

Answer 3

在我看来，这就像您需要构建一个电子表格来保存需要存储的数据。您可以使用名为 openpyxl 的库来执行此操作，然后为品牌、产品、尺寸、upcs、代码创建列。然后将来自 beautifulsoup 对象的结果存储到电子表格中。

迭代改变列表长度并附加到另一个列表

3 个答案: