刮取包含::之前的网页

时间:2017-11-29 20:19:30

标签: python css web-scraping beautifulsoup

我的问题是,当使用bs4来刮取HTML时,抓不到包含::before的内容。

我想知道公司有助于页面中的哪些SDG。 https://www.unglobalcompact.org/participation/report/cop/create-and-submit/active/395091 但是勾选标记在源代码中是不可见的。

我该怎么办?或者我可以用什么来从网站上删除它?

1 个答案:

答案 0 :(得分:0)

您根本不需要byte[] data = sendToServerObject.toString().getBytes("UTF-8"); DataOutputStream os = new DataOutputStream(conn.getOutputStream()); os.writeInt(data.length); os.write(data); ... 部分。已选择和未选定的元素具有不同的类 - 已选择::before::,未选中selected_question

您可以使用以下内容解析它:

advanced_question

会打印:

from bs4 import BeautifulSoup
import requests


url = "https://www.unglobalcompact.org/participation/report/cop/create-and-submit/active/395091"
response = requests.get(url)

soup = BeautifulSoup(response.content, "lxml")

questions = soup.select("ul.questionnaire > li.question_group")
for question in questions:
    question_text = question.get_text(strip=True)
    print(question_text)

    answers = question.find_next_siblings("li")
    for answer in answers:
        answer_text = answer.get_text(strip=True)
        is_selected = "selected_question" in answer.get("class", [])

        print(answer_text, is_selected)
    print("-----")

请注意为所选答案打印的Which of the following Sustainable Development Goals (SDGs) do the activities described in your COP address? [Select all that apply] SDG 1: End poverty in all its forms everywhere False SDG 2: End hunger, achieve food security and improved nutrition and promote sustainable agriculture False SDG 3: Ensure healthy lives and promote well-being for all at all ages True SDG 4: Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all False ...

我还注意到,如果选择True作为解析器,则此代码无法正常工作。