使用beautifulsoup刮擦熊猫数据框中的问题/错误

时间:2019-06-13 16:29:52

标签: python pandas dataframe beautifulsoup screen-scraping

我正在使用此csv(https://www.kaggle.com/jtrofe/beer-recipes),我想抓取数据框中的每个URL,但是我不能,因为遇到问题/错误,因此无法抓取所有URL ,如果我尝试使用1个URL,就可以了,但是使用该功能存在问题...有人可以帮助我吗?

这是我的代码:

import requests
from bs4 import BeautifulSoup
from time import sleep 


headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}

base = 'https://www.brewersfriend.com'
links = [f'{base}{r}' for r in df['URL']]

while True:
    try:
        r = requests.get(links, headers=headers, stream=False, timeout=8).text
        break
    except:
        if r.status_code == 404:
            print("Client error")
            r.raise_for_status()
        sleep(1)


soup = BeautifulSoup(r, 'html5lib')

rating = soup.find('span', {'itemprop': 'ratingValue'})

DEFAULT_VALUE = 'NaN'

if rating is None:
    rating = DEFAULT_VALUE
    
print(rating.text)

我已经知道在某些页面中没有评级,因此我创建了DEFAULT_VALURE而不是数字,但这也许也是一个错误。

在此代码之前有数据框,但我也没有放置。

我希望有人能帮助我!

非常感谢

2 个答案:

答案 0 :(得分:0)

各种杂乱的东西在这里。我不会全部解决,但是我看到的一件事是您正在尝试print (rating.text)。如果您的评分为'NaN',则一个错误是您无法rating.text

这不是我写的方式,而是从您的初始编码开始:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from time import sleep 


df = pd.read_csv('C:/recipeData/recipeData.csv', encoding = 'ISO-8859-1')
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}
base = 'https://www.brewersfriend.com'

links = [f'{base}{r}' for r in df['URL']]
for link in links:
    try:
        r = requests.get(link, headers=headers, stream=False, timeout=8)

        if r.status_code == 404:
            print("Client error")
            r.raise_for_status()
            continue
        else:
            r = r.text     
    except:
        continue


    soup = BeautifulSoup(r, 'html5lib')
    rating = soup.find('span', {'itemprop': 'ratingValue'}).text
    DEFAULT_VALUE = 'NaN'

    if rating is None:
        rating = DEFAULT_VALUE

    print('%s: %s' %(link,rating))

答案 1 :(得分:0)

这是完成整个过程的一种方式

import requests, re
import pandas as pd
from bs4 import BeautifulSoup as bs

p = re.compile(r'dataviewToken":"(.*?)"')
p1 = re.compile(r'"rowCount":(\d+)')
results = []
i = 0

with requests.Session() as s:
    r = s.get('https://www.kaggle.com/jtrofe/beer-recipes')   
    token = p.findall(r.text)[0]
    rows = int(p1.findall(r.text)[0])
    data = {"jwe":{"encryptedToken": token},"source":{"type":3,"dataset":{"url":"jtrofe/beer-recipes","tableType":1,"csv":{"fileName":"recipeData.csv","delimiter":",","headerRows":1}}},"select":["BeerID","Name","URL","Style","StyleID","Size(L)","OG","FG","ABV","IBU","Color","BoilSize","BoilTime","BoilGravity","Efficiency","MashThickness","SugarScale","BrewMethod","PitchRate","PrimaryTemp"],"skip":0,"take": rows}
    base = 'https://www.brewersfriend.com'
    r = s.post('https://www.kaggleusercontent.com/services/datasets/kaggle.dataview.v1.DataViewer/GetDataView', json = data).json()
    names, links = zip(*[(row['text'][1], base + row['text'][2]) for row in r['dataView']['rows']])

    for link in links:
        r = s.get(link, headers = {'User-Agent' : 'Mozilla/5.0'})
        if r.status_code == 403:
            rating = 'N/A'
        else:
            soup = bs(r.content, 'lxml')
            rating = soup.select_one('[itemprop=ratingValue]')
            if rating is None:
                rating = 'N/A'
            else:
                rating = rating.text
        row = [names[i], rating]
        results.append(row)
        i+=1

df = pd.DataFrame(results, columns = ['Name', 'Rating'])
print(df.head())
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )