Question

我正在网上抓取来自不同网站的餐厅信息（例如名称、地址）。我通常使用两种主要方法：rvest() 包中的函数 (R) 和 BeautifulSoup 模块中的函数 (python)。大部分时间我都设法收集信息（当R失败时，通常python可以工作）但有时，在阅读网页后，我在我选择的节点中找不到信息。以下是提供此行为的两个网站：

澳大利亚：https://www.yellowpages.com.au/search/listings?clue=restaurant&locationClue=All+States&pageNumber=2&referredBy=www.yellowpages.com.au&&eventType=pagination

突尼斯：http://www.pagesjaunes.com.tn/List?page=2

这是我用来访问餐厅名称的 R 代码（示例仅用于一页）：

library(rvest)
library(xml2)

# AUSTRALIA
webpage <- read_html(x = "https://www.yellowpages.com.au/search/listings?clue=restaurant&locationClue=All+States&pageNumber=2&referredBy=www.yellowpages.com.au&&eventType=pagination")

webpage_name <- webpage %>%
  html_nodes("a.listing-name") %>% 
  html_text(trim = TRUE)

webpage_name

# TUNISIA
webpage <- read_html(x = "http://www.pagesjaunes.com.tn/List?page=2")

webpage_name <- webpage %>%
  html_nodes("div.result-info-block") %>% 
  html_nodes("a") %>% 
  html_text(trim = TRUE)

webpage_name

如您所见，webpage_name 对象是空的。使用 python 结果不会改变：

from bs4 import BeautifulSoup
import requests

# AUSTRIALIA
webpage = requests.get("https://www.yellowpages.com.au/search/listings?clue=restaurant&locationClue=All+States&pageNumber=2&referredBy=www.yellowpages.com.au&&eventType=pagination")
soup = BeautifulSoup(webpage.content, "html.parser")
soup_names = soup.find_all("a", {"class": "listing-name"})

print(soup_names)

# TUNISIA
webpage = requests.get("www.pagesjaunes.com.tn/List?page=2")
soup = BeautifulSoup(webpage.content, "html.parser")
soup_names = soup.find_all("div", {"class": "result-info-block"})

print(soup_names)

就突尼斯而言，我认为问题在于我选择的类别的 URL 不会更改，而是在主 URL 之后的每个类别中只显示“List?page=2”。 但是为什么我可以在网站中显示信息列表？

对于澳大利亚页面，我真的不知道发生了什么。这些是我一般每个网站都会使用的方法，但有时我会遇到这些问题。

你知道如何进行吗？谢谢！

R/python - 使用 rvest 和/或 BeautifulSoup 抓取网页信息（有时）没有结果

0 个答案: