通过Beautifulsoup进行爬网时内核重新启动

时间:2019-08-05 11:04:09

标签: python web-scraping beautifulsoup

我正试图通过Beautiful Soup从网站上抓取数据。当我检查单个产品的结果时,答案都是正确的,但是当我将其循环运行时,我的内核崩溃并继续重新启动。

在运行循环的最后步骤中,我怀疑存在问题。

由于我无法访问我的机器,因此我正在使用IBM的Jupyter Notebook开发人员技能网络来运行代码。我还通过另一个在线Jupyter链接尝试了此操作,但这无济于事。

#importing relevant packages & installing seaborn & beautiful soup (bs4)
!pip install seaborn
!pip install bs4
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.galeria-kaufhof.de/zuhause/bad-schlafzimmer/bettwaesche-bettlaken/"
html = urlopen(url)

soup = BeautifulSoup(html, 'html')
type(soup)
  
    
      

这是问题开始的地方

    
  
#find all div classes in the website - or number of products displayed on one page:
product_container = soup.find_all('div', class_ = 'gk-article gk-article--big')
print(type(product_container))
print(len(product_container))

# Lists to store the scraped data in
brand_name = []
mini_description =[]
description = []
old_price = []
new_price = []
colour_variants = []

#Extract data from each product container:

for container in product_container:

#extract the brand_name:

    brand_name= soup.find(class_="gk-article__brand")
    Brand=brand_name.text
    brand_name.append(Brand)

#extract the mini_description:

    name= soup.find(class_="gk-article__name")
    item_name=name.text
    mini_description.append(item_name)

#extract the full description:

    description= soup.find(class_="gk-article__description")
    item_desc= description.meta
    item_description=item_desc.get("content")
    description.append(item_description)

#extract original price:

    Old_Price= soup.find(class_="gk-article__offers")
    oprice= Old_Price.meta
    o_price=oprice.get("content")
    old_price.append(o_price)

#extract new price (if discounted):

    new_price= soup.find(class_="gk-article__offers")
    nprice = new_price.meta
    n_price= nprice.get("content")
    new_price.append(n_price)

#extract all variants:
    variants= soup.find(class_='gk-article__variants').find_all("img")
    for img in variants:
        x = (img.get("alt"))
        colour_variants.append(x) 

  
    
      

有人可以看看为什么此代码无法正常工作吗,我也怀疑数组的数量有问题。因此,每种产品都有不同数量的颜色变体,因此每个条目不一定只包含在一行中。我不知道该怎么解决

    
  

#converting into a Dataframe:
import pandas as pd
test_df = pd.DataFrame({'brand': brand_name,
'headline': mini_description, 
'description' : description, 
'oldprice': old_price,
'newprice' : new_price,
'colour_option' : colour_variants,
})
print(test_df.info())
test_df


此错误弹出:

“内核重新启动 对galeria kaufhof.ipynb进行网络抓取的内核似乎已死。它将自动重启。”

请帮助我解决此问题

0 个答案:

没有答案