我正在创建一个 IMDB 评级和评论数据集。
Link
我想抓取此页面上的所有评分和评论。有些评论没有评分,因此我对评论和评分的计数是不同的。
我尝试了各种方法来处理空值,但未能成功实现。
我的代码:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import itertools
import string
url = (
"https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey={}"
)
key = ""
data = {"user_id": [], "rating":[], "title": [], "review": []}
while True:
response = requests.get(url.format(key))
soup = BeautifulSoup(response.content, "html.parser")
# Find the pagination key
pagination_key = soup.find("div", class_="load-more-data")
if not pagination_key:
break
for user in (
[tag.attrs['href'] for tag in soup.find_all('a', attrs={'class': None})
if tag.attrs['href'].startswith('/user') & tag.attrs['href'].endswith('/')]
):
data["user_id"].append(user[6:-1])
for rate in (
[tag.previous_element for tag in soup.find_all('span', attrs={'class': 'point-scale'})]
):
if (rate.__eq__(None)):
data["rating"].append(None)
else:
data["rating"].append(rate)
## Update the 'key' variable in-order to scrape more reviews
key = pagination_key["data-key"]
for title, review in zip(
soup.find_all(class_="title"), soup.find_all(class_="text show-more__control")
):
data["title"].append(title.get_text(strip=True))
data["review"].append(review.get_text())
df = pd.DataFrame(data)
print(df)
len(data['rating'])
>>>2107
len(data['review'])
>>>2150
错误:
ValueError Traceback (most recent call last)
<ipython-input-28-0064f972ba2a> in <module>()
41
42
---> 43 df = pd.DataFrame(data)
44 print(df)
3 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in extract_index(data)
395 lengths = list(set(raw_lengths))
396 if len(lengths) > 1:
--> 397 raise ValueError("arrays must all be same length")
398
399 if have_dicts:
ValueError: arrays must all be same length
我想为数据框中不可用的评分设置空白值。
答案 0 :(得分:1)
不幸的是,并不总是有评级,所以这里的逻辑失败了:
for rate in (
[tag.previous_element for tag in soup.find_all('span', attrs={'class': 'point-scale'})]
):
if (rate.__eq__(None)):
data["rating"].append(None)
else:
data["rating"].append(rate)
无论您附加什么,您最终都会遍历少于预期数量的项目。
一种可能的解决方案:
您需要修改以确保您循环的项目数与其他列表相同,例如
for rate in (
[tag.select_one('.point-scale').previous_element if tag.select_one('.point-scale') is not None else None
for tag in soup.select('.lister-item-content')]
):
data["rating"].append(rate)
旁注:
您可以通过添加进行调试,如下所示:
if not pagination_key:
break
以下内容:
if len(soup.select('.lister-item-content, .point-scale')) % 2:
print(url.format(key))
break
然后在浏览器中访问打印的 url 并在元素选项卡浏览器查找框中输入 .lister-item-content, .point-scale
并点击返回;如果您获得的匹配数量奇数,则表示缺少评分,您可以循环查看评论以查看位置。