Question

我正在创建一个 IMDB 评级和评论数据集。
Link
我想抓取此页面上的所有评分和评论。有些评论没有评分，因此我对评论和评分的计数是不同的。
我尝试了各种方法来处理空值，但未能成功实现。

我的代码：

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import itertools
import string

url = (
    "https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey={}"
)
key = ""
data = {"user_id": [], "rating":[], "title": [], "review": []}

while True:
    response = requests.get(url.format(key))
    soup = BeautifulSoup(response.content, "html.parser")
    # Find the pagination key
    pagination_key = soup.find("div", class_="load-more-data")
    if not pagination_key:
        break

    for user in (
        [tag.attrs['href'] for tag in soup.find_all('a', attrs={'class': None})
                if tag.attrs['href'].startswith('/user') & tag.attrs['href'].endswith('/')]
    ):
        data["user_id"].append(user[6:-1])

    for rate in (
        [tag.previous_element for tag in soup.find_all('span', attrs={'class': 'point-scale'})]
    ):
      if (rate.__eq__(None)):
        data["rating"].append(None)
      else:
        data["rating"].append(rate)
    
    ## Update the 'key' variable in-order to scrape more reviews
    key = pagination_key["data-key"]
    for title, review in zip(
        soup.find_all(class_="title"), soup.find_all(class_="text show-more__control")
    ):
        data["title"].append(title.get_text(strip=True))
        data["review"].append(review.get_text())
  

df = pd.DataFrame(data)
print(df)

len(data['rating'])
>>>2107

len(data['review'])
>>>2150

错误：

ValueError                                Traceback (most recent call last)
<ipython-input-28-0064f972ba2a> in <module>()
     41 
     42 
---> 43 df = pd.DataFrame(data)
     44 print(df)

3 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in extract_index(data)
    395             lengths = list(set(raw_lengths))
    396             if len(lengths) > 1:
--> 397                 raise ValueError("arrays must all be same length")
    398 
    399             if have_dicts:

ValueError: arrays must all be same length

我想为数据框中不可用的评分设置空白值。

Answer 1

不幸的是，并不总是有评级，所以这里的逻辑失败了：

for rate in (
        [tag.previous_element for tag in soup.find_all('span', attrs={'class': 'point-scale'})]
    ):
      if (rate.__eq__(None)):
        data["rating"].append(None)
      else:
        data["rating"].append(rate)

无论您附加什么，您最终都会遍历少于预期数量的项目。

一种可能的解决方案：

您需要修改以确保您循环的项目数与其他列表相同，例如

for rate in (
    [tag.select_one('.point-scale').previous_element if tag.select_one('.point-scale') is not None else None 
     for tag in soup.select('.lister-item-content')] 
):
    data["rating"].append(rate)

旁注：

您可以通过添加进行调试，如下所示：

if not pagination_key:
    break

以下内容：

if len(soup.select('.lister-item-content, .point-scale')) % 2:
    print(url.format(key))
    break

然后在浏览器中访问打印的 url 并在元素选项卡浏览器查找框中输入 .lister-item-content, .point-scale 并点击返回；如果您获得的匹配数量奇数，则表示缺少评分，您可以循环查看评论以查看位置。

使用BeautifulSoup从抓取的数据创建数据帧时出现数组长度错误

1 个答案: