无法抓取网页上的所有评论

时间:2019-07-29 20:29:34

标签: python python-3.x selenium selenium-webdriver web-scraping

我一直在尝试从Google地图的某个页面中获取所有评论,但是我的以下脚本只能解析其中的一些评论。手动向下滚动时,加载脚本中使用的评论时,我可以看到微调框。

通常,我可以使用driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")到达网页的底部。

但是,内容位于该页面的左侧窗口中,这也许就是上述命令无法正常工作的原因。

Webpage address

我尝试过(它只解析前几条评论):

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://www.google.com/maps/place/Pizzeria+Di+Matteo/@40.8512552,14.255779,17z/data=!4m7!3m6!1s0x133b0841ef6e38e5:0xece6ea09987e9baf!8m2!3d40.8512512!4d14.2579677!9m1!1b1"
driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 10)

while True:  #this block is not working at all
    try:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        wait.until_not(EC.presence_of_element_located((By.CSS_SELECTOR, "[class='section-loading-spinner']")))
    except Exception:
        break


for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".section-review-content"))):
    name = WebDriverWait(item,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "[class='section-review-title'] > span"))).text
    review = WebDriverWait(item,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "[class='section-review-text']"))).text
    print(name,review)

如何从该页面抓取所有评论?

3 个答案:

答案 0 :(得分:1)

您可以使用ActionChains和TouchActions

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import TouchActions
from selenium.webdriver.common.action_chains import ActionChains

link = "https://www.google.com/maps/place/Pizzeria+Di+Matteo/@40.8512552,14.255779,17z/data=!4m7!3m6!1s0x133b0841ef6e38e5:0xece6ea09987e9baf!8m2!3d40.8512512!4d14.2579677!9m1!1b1"
driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 10)

item = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".section-review-content")))[-1]
ActionChains(driver).move_to_element(item).perform()
touch_actions = TouchActions(driver)
touch_actions.scroll(0, 8000).perform()
wait = WebDriverWait(driver, 10)

for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".section-review-content"))):
    name = WebDriverWait(item, 10).until(
        EC.visibility_of_element_located((By.CSS_SELECTOR, "[class='section-review-title'] > span"))).text
    review = WebDriverWait(item, 10).until(
        EC.visibility_of_element_located((By.CSS_SELECTOR, "[class='section-review-text']"))).text
    print(name, review)

答案 1 :(得分:0)

尝试以下脚本以获取该页面上的所有评论。简而言之,当该脚本找到该微调器时,下一行driver.execute_script("arguments[0].scrollIntoView();",elem)会将该微调器滚动到视口,并继续执行此操作,直到没有要加载的内容为止。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://www.google.com/maps/place/Pizzeria+Di+Matteo/@40.8512552,14.255779,17z/data=!4m7!3m6!1s0x133b0841ef6e38e5:0xece6ea09987e9baf!8m2!3d40.8512512!4d14.2579677!9m1!1b1"
driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver,10)

while True:
    try:
        elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "[class='section-loading-spinner']")))
        driver.execute_script("arguments[0].scrollIntoView();",elem)
    except Exception:
        break


for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".section-review-content"))):
    name = WebDriverWait(item,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "[class='section-review-title'] > span"))).text
    review = WebDriverWait(item,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "[class='section-review-text']"))).text
    print(name,review)

答案 2 :(得分:0)

您也可以在没有浏览器自动化的情况下获得所有评论。

您只需要data_id(看起来像这样:0x133b0841ef6e38e5:0xece6ea09987e9baf,您可以从您发布的地图网址中获取)

之后,您只需向以下地址发出请求:https://www.google.com/async/reviewDialog?hl=en&async=feature_id:0x133b0841ef6e38e5:0xece6ea09987e9baf,sort_by:,next_page_token:,associated_topic:,_fmt:pc

您会在那里找到所有评论数据以及 next_page_token,以便您查询接下来的 10 条评论。

在这种情况下,next_page_token 是:EgIICg

因此,接下来 10 条评论的请求 URL 将是:https://www.google.com/async/reviewDialog?hl=en&async=feature_id:0x133b0841ef6e38e5:0xece6ea09987e9baf,sort_by:,next_page_token:EgIICg,associated_topic:,_fmt:pc

您也可以使用第三方解决方案,例如 SerpApi。这是一个免费试用的付费 API。我们为您处理代理、解析验证码并解析所有丰富的结构化数据。

示例 Python 代码(也可在其他库中使用):

from serpapi import GoogleSearch

params = {
  "api_key": "secret_api_key",
  "engine": "google_maps_reviews",
  "hl": "en",
  "data_id": "0x133b0841ef6e38e5:0xece6ea09987e9baf",
}

search = GoogleSearch(params)
results = search.get_dict()

示例 JSON 输出:

"place_info": {
  "title": "Pizzeria Di Matteo",
  "address": "Via dei Tribunali, 94, Napoli NA, Italy",
  "rating": 4.4,
  "reviews": 7082
},
"reviews": [
  {
    "user": {
      "name": "Thomas Bichler",
      "link": "https://www.google.com/maps/contrib/117092186939269423235?hl=en-US&sa=X&ved=2ahUKEwifo9yvmODxAhWHY98KHSDtAakQvvQBegQIARAw",
      "thumbnail": "https://lh3.googleusercontent.com/a-/AOh14GjDBVeLxUSBqv4WKvPuVqMbpZ5cdDfjyTlcSxTgQKw=s40-c-c0x00000000-cc-rp-mo-ba4-br100",
      "local_guide": true,
      "reviews": 164,
      "photos": 88
    },
    "rating": 5,
    "date": "a week ago",
    "snippet": "Great Pizza Fritta! Although the place looks only like a take-away from the outside there is plenty of seating places in the back and upstairs. Don't be afraid to pass by the deep-fryer and make your way to the back. The Pizza Fritta we had easily serves two for lunch. We ordered one per person and couldn't finish it.Nothing fancy here and that's good. Everything is focuessed on the food!",
    "images": [
      "https://lh5.googleusercontent.com/p/AF1QipO0xuP-Fbq1R88lr4ecedPcV6I34uIxqW6tro_m=w100-h100-p-n-k-no"
    ]
  },
  {
    "user": {
      "name": "Andrea Caruso",
      "link": "https://www.google.com/maps/contrib/103428787808835823312?hl=en-US&sa=X&ved=2ahUKEwifo9yvmODxAhWHY98KHSDtAakQvvQBegQIARA_",
      "thumbnail": "https://lh3.googleusercontent.com/a-/AOh14GgYfhxg1E5DZJJ8YJtjTS5lbhYrQ5ekpbak4VKAtg4=s40-c-c0x00000000-cc-rp-mo-ba5-br100",
      "local_guide": true,
      "reviews": 290,
      "photos": 15
    },
    "rating": 5,
    "date": "6 days ago",
    "snippet": "Classic Italian pizza and one of the best ones!I really liked the Margherita and the melanzane.Slightly disappointed by the sausage in the salsiccia and friarielli.Friendly staff and fast service",
    "images": [
      "https://lh5.googleusercontent.com/p/AF1QipOw0-2R5bGQ6NVTimGHOOJjUtBvf5phfA0PYtpe=w100-h100-p-n-k-no",
      "https://lh5.googleusercontent.com/p/AF1QipM2UUdwNMZnogt2DMptnQsfiFmP_b5mgQPhXlUn=w100-h100-p-n-k-no",
      "https://lh5.googleusercontent.com/p/AF1QipPaFQYhx5Z2nM0AxDWus52R0xK-5KJP6ZQuEHJU=w100-h100-p-n-k-no"
    ]
  },
  {
    "user": {
      "name": "asia rizzoli",
      "link": "https://www.google.com/maps/contrib/106734226225183718730?hl=en-US&sa=X&ved=2ahUKEwifo9yvmODxAhWHY98KHSDtAakQvvQBegQIARBQ",
      "thumbnail": "https://lh3.googleusercontent.com/a-/AOh14GhiFJr7dFNeo6Y85fCdmMYyaOLl-2YZSPzVTQj83A=s40-c-c0x00000000-cc-rp-mo-ba4-br100",
      "local_guide": true,
      "reviews": 69,
      "photos": 111
    },
    "rating": 4,
    "date": "3 weeks ago",
    "snippet": "We tried the ‘frittura’ ‘fried street food’ outside the restaurant entrance. The ‘frittata di pasta’ (fried pasta) was Amazing! I totally fell in love with it. The ‘crocché’ (fried mashed patatos) was also good, although nothing special. We did not like the arancini a lot.. but on a balance it is worth a try!",
    "images": [
      "https://lh5.googleusercontent.com/p/AF1QipMamXnfHbILMA5y_6v5qHkXjoDM3_i9hwllAwn8=w100-h100-p-n-k-no",
      "https://lh5.googleusercontent.com/p/AF1QipMwpZ2dqLYnwg2KePI46Qfnk46vDzhXJwCkCjn3=w100-h100-p-n-k-no",
      "https://lh5.googleusercontent.com/p/AF1QipOwfcwtQIDMUq1M4Qcg54mqrIg1fUWnDZkXfDhr=w100-h100-p-n-k-no",
      "https://lh5.googleusercontent.com/p/AF1QipO6I7EgACHnh8UJbfi6H_43m1bPJ4hefqGBn7Zz=w100-h100-p-n-k-no",
      "https://lh5.googleusercontent.com/p/AF1QipMlLbFU0AaGKkvlLymTX9eTJEDlS881TMvE6Lvk=w100-h100-p-n-k-no",
      "https://lh5.googleusercontent.com/p/AF1QipNhhzjpiaxCFE4LCAhXGYAT-jz8dsn-S1SDtsWy=w100-h100-p-n-k-no"
    ]
  },
  ...
]

查看documentation了解更多详情。

免责声明:我在 SerpApi 工作。