我想遍历一个 beautifulsoup 对象,该对象根据它找到的与 HTML 标签匹配的元素数量来改变长度。
driver.get('https://www.inspection.gc.ca/food-recall-warnings-and-allergy-alerts/2021-02-10/eng/1613010591343/1613010596418')
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
recall_details = soup.find('table', class_ = 'table table-bordered table-condensed')
recalled_products = recall_details.find_all('td')
recalled_products
输出:
[<td>One Ocean</td>,
<td>Sliced Smoked Wild Sockeye Salmon</td>,
<td>300 g</td>,
<td>6 25984 00005 3</td>,
<td>11253</td>]
我想遍历每个 td 元素并附加到这样的列表中:
brands = []
products = []
sizes = []
upcs = []
codes = []
brand = recalled_products[0].text
product = recalled_products[1].text
size = recalled_products[2].text
upc = recalled_products[3].text
code = recalled_products[4].text
brands.append(brand)
products.append(product)
sizes.append(size)
upcs.append(upc)
codes.append(code)
print(brands)
print(products)
print(sizes)
print(upcs)
print(codes)
输出:
['One Ocean']
['Sliced Smoked Wild Sockeye Salmon']
['300\xa0g']
['6\xa025984\xa000005\xa03']
['11253']
我尝试了以下代码,但没有得到预期的结果。我想我需要某种计数器。
for i in range(len(recalled_products)):
brand = recalled_products[i].text
product = recalled_products[i].text
size = recalled_products[i].text
upc = recalled_products[i].text
code = recalled_products[i].text
brands.append(brand)
products.append(product)
sizes.append(size)
upcs.append(upc)
codes.append(code)
print(brands)
print(products)
print(sizes)
print(upcs)
print(codes)
```
Output:
```
['One Ocean', 'Sliced Smoked Wild Sockeye Salmon', '300\xa0g', '6\xa025984\xa000005\xa03', '11253']
['One Ocean', 'Sliced Smoked Wild Sockeye Salmon', '300\xa0g', '6\xa025984\xa000005\xa03', '11253']
['One Ocean', 'Sliced Smoked Wild Sockeye Salmon', '300\xa0g', '6\xa025984\xa000005\xa03', '11253']
['One Ocean', 'Sliced Smoked Wild Sockeye Salmon', '300\xa0g', '6\xa025984\xa000005\xa03', '11253']
['One Ocean', 'Sliced Smoked Wild Sockeye Salmon', '300\xa0g', '6\xa025984\xa000005\xa03', '11253']
预先感谢您提供的任何帮助。
答案 0 :(得分:2)
关于数据的问题是从返回
recalled_products = recall_details.find_all('td')
A = [[<td>beef</td>,
<td>250g</td>,
<td>6 25984 00005 3</td>,
<td>11253</td>],
[<td>Salmon</td>,
<td>300 g</td>,
<td>6 25984 00005 3</td>,
<td>11253</td>]]
或
b = [<td>beef</td>,
<td>250g</td>,
<td>6 25984 00005 3</td>,
<td>11253</td>,
<td>Salmon</td>,
<td>300 g</td>,
<td>6 25984 00005 3</td>,
<td>11253</td>]
对于 A,您想使用索引二维数组
for i in range(len(recalled_products)):
brand = recalled_products[i][0].text
product = recalled_products[i][1].text
对于 B,您想在迭代中使用一个步骤
for i in range(0,len(recalled_products),4):
brand = recalled_products[i].text
product = recalled_products[i+1].text
答案 1 :(得分:1)
这就是我获取标记的方式。
from bs4 import BeautifulSoup
import requests
URL = "https://www.inspection.gc.ca/food-recall-warnings-and-allergy-alerts/2021-02-10/eng/1613010591343/1613010596418"
brands = []
products = []
sizes = []
upcs = []
codes = []
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
recall_details = soup.find("table", class_="table table-bordered table-condensed")
body = recall_details.find("tbody")
rows = body.find_all("tr")
for row in rows:
data = row.find_all("td")
brands.append(data[0].text)
products.append(data[1].text)
sizes.append(data[2].text)
upcs.append(data[3].text)
codes.append(data[4].text)
印刷品
['One Ocean']
['Sliced Smoked Wild Sockeye Salmon']
['300\xa0g']
['6\xa025984\xa000005\xa03']
['11253']
我确实认为 dict 是比多个列表更好的数据结构,但当然这因您的用例而异。
如果你想这样做,你可以像这样更改代码:
recalled = []
...
for row in rows:
data = row.find_all("td")
item = {
"brand": data[0].text,
"products": data[1].text,
"sizes": data[2].text,
"upcs": data[3].text,
"codes": data[4].text,
}
recalled.append(item)
印刷品
[{'brand': 'One Ocean', 'products': 'Sliced Smoked Wild Sockeye Salmon', 'sizes': '300\xa0g', 'upcs': '6\xa025984\xa000005\xa03', 'codes': '11253'}]
答案 2 :(得分:0)
在我看来,这就像您需要构建一个电子表格来保存需要存储的数据。您可以使用名为 openpyxl 的库来执行此操作,然后为品牌、产品、尺寸、upcs、代码创建列。然后将来自 beautifulsoup 对象的结果存储到电子表格中。