我正试图通过Beautiful Soup从网站上抓取数据。当我检查单个产品的结果时,答案都是正确的,但是当我将其循环运行时,我的内核崩溃并继续重新启动。
在运行循环的最后步骤中,我怀疑存在问题。
由于我无法访问我的机器,因此我正在使用IBM的Jupyter Notebook开发人员技能网络来运行代码。我还通过另一个在线Jupyter链接尝试了此操作,但这无济于事。
#importing relevant packages & installing seaborn & beautiful soup (bs4)
!pip install seaborn
!pip install bs4
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.galeria-kaufhof.de/zuhause/bad-schlafzimmer/bettwaesche-bettlaken/"
html = urlopen(url)
soup = BeautifulSoup(html, 'html')
type(soup)
这是问题开始的地方
#find all div classes in the website - or number of products displayed on one page:
product_container = soup.find_all('div', class_ = 'gk-article gk-article--big')
print(type(product_container))
print(len(product_container))
# Lists to store the scraped data in
brand_name = []
mini_description =[]
description = []
old_price = []
new_price = []
colour_variants = []
#Extract data from each product container:
for container in product_container:
#extract the brand_name:
brand_name= soup.find(class_="gk-article__brand")
Brand=brand_name.text
brand_name.append(Brand)
#extract the mini_description:
name= soup.find(class_="gk-article__name")
item_name=name.text
mini_description.append(item_name)
#extract the full description:
description= soup.find(class_="gk-article__description")
item_desc= description.meta
item_description=item_desc.get("content")
description.append(item_description)
#extract original price:
Old_Price= soup.find(class_="gk-article__offers")
oprice= Old_Price.meta
o_price=oprice.get("content")
old_price.append(o_price)
#extract new price (if discounted):
new_price= soup.find(class_="gk-article__offers")
nprice = new_price.meta
n_price= nprice.get("content")
new_price.append(n_price)
#extract all variants:
variants= soup.find(class_='gk-article__variants').find_all("img")
for img in variants:
x = (img.get("alt"))
colour_variants.append(x)
有人可以看看为什么此代码无法正常工作吗,我也怀疑数组的数量有问题。因此,每种产品都有不同数量的颜色变体,因此每个条目不一定只包含在一行中。我不知道该怎么解决
#converting into a Dataframe:
import pandas as pd
test_df = pd.DataFrame({'brand': brand_name,
'headline': mini_description,
'description' : description,
'oldprice': old_price,
'newprice' : new_price,
'colour_option' : colour_variants,
})
print(test_df.info())
test_df
此错误弹出:
“内核重新启动 对galeria kaufhof.ipynb进行网络抓取的内核似乎已死。它将自动重启。”
请帮助我解决此问题