刮痧:无法从网络访问信息

时间:2016-05-20 10:01:17

标签: python web-scraping beautifulsoup html-parsing

我正在从这个网址抓取一些信息:https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon#description-tab

在我删除描述之前,一切都很好。 我试过去试图刮,但到目前为止我失败了。 好像我无法获得这些信息。这是我的代码:

html = urllib.urlopen("https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon")
tree=BeautifulSoup(html, "lxml")
description=tree.find('div',{'id':'description_section','class':'description-section'})

你们有什么建议吗?

3 个答案:

答案 0 :(得分:1)

您需要提出额外请求以获取说明。以下是使用requests + BeautifulSoup

的完整工作示例
import requests
from bs4 import BeautifulSoup

url = "https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/"
with requests.Session() as session:
    session.headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
    }

    # get the token
    response = session.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    token = soup.find("meta", {"name": "csrf-token"})["content"]

    # get the description
    description_url = url + "description"
    response = session.get(description_url, headers={"X-CSRF-Token": token, "X-Requested-With": "XMLHttpRequest"})

    soup = BeautifulSoup(response.content, "html.parser")
    description = soup.find('div', {'id':'description_section', 'class': 'description-section'})
    print(description.get_text(strip=True))

答案 1 :(得分:0)

我使用XML包进行网页抓取,而且我无法获得与BeautifulSoup描述的描述部分。

但是,如果您只想废弃此页面,则可以下载该页面。然后:

page = htmlTreeParse(“月球狮子 - 有史以来第一次以大学为主导的月球任务_RocketHub.html”,                      useInternal = TRUE,encoding =“utf8”)

取消列表(xpathApply(页面,'// div [@ id =“description_section”]',xmlValue))

我尝试下载R代码,但我也找不到description_section。

URL = “https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon”

download.file(URL, “page.html中”,模式= “W”)

也许我们必须在函数download.file中添加一些选项。我希望一些HTML专家可以提供帮助。

答案 2 :(得分:0)

我发现如何使用R:

进行报废
library("rvest")

url="https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/description"

url %>% 
  html() %>% 
  html_nodes(xpath='//div[@id="description_section"]', xmlValue) %>%
  html_text()