Question

我草率地编写了一个脚本，以从网页中获取不同问题的答案。问题是答案不在我当前针对的元素之外。我知道，如果我曾经用过.next_sibling，就可以用BeautifulSoup来抓取它们，但万一遇到麻烦，我找不到任何想法。

HTML元素类似于：

  <p>
   <b>
    <span class="blue">
     Q:1-The NIST Information Security and Privacy Advisory Board (ISPAB) paper "Perspectives on Cloud Computing and Standards" specifies potential advantages and disdvantages of virtualization. Which of the following disadvantages does it include?
    </span>
    <br/>
    Mark one answer:
   </b>
   <br/>
   <input name="quest1" type="checkbox" value="1"/>
   It initiates the risk that malicious software is targeting the VM environment.
   <br/>
   <input name="quest1" type="checkbox" value="2"/>
   It increases overall security risk shared resources.
   <br/>
   <input name="quest1" type="checkbox" value="3"/>
   It creates the possibility that remote attestation may not work.
   <br/>
   <input name="quest1" type="checkbox" value="4"/>
   All of the above
  </p>

到目前为止，我已经尝试过：

import requests
from scrapy import Selector

url = "https://www.test-questions.com/csslp-exam-questions-01.php"

res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
sel = Selector(res)
for item in sel.css("[name^='quest']::text").getall():
    print(item)

上面的脚本在被预期时不会打印任何内容，也不会引发任何错误。

上面粘贴的html元素的预期输出之一是：

It initiates the risk that malicious software is targeting the VM environment.

我只在使用任何CSS选择器解决方案之后。

如何从该站点获取其他问题的答案？

Answer 1

以下简单的CSS选择器和python列表函数的组合可以解决此任务：

import scrapy
from scrapy.crawler import CrawlerProcess

class QuestionsSpider(scrapy.Spider):
    name = "TestSpider"
    start_urls = ["https://www.test-questions.com/csslp-exam-questions-01.php"]

    def parse(self,response):
    #select <p> tag elements with questions/answers
        questions_p_tags = [ p for p in response.css("form p")
                             if '<span class="blue"' in p.extract()]
        for p in questions_p_tags:
    #select question and answer variants inside every <p> tag
            item = dict()
            item["question"] = p.css("span.blue::text").extract_first()
    #following list comprehension - select all text, filter empty text elements
    #and select last 4 text elements as answer variants
            item["variants"] = [variant.strip() for variant in p.css("::text").extract() if variant.strip()][-4:]
            yield item

if __name__ == "__main__":
    c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'})
    c.crawl(QuestionsSpider)
    c.start()

Answer 2

您可以尝试在标签后以following-sibling::text()的形式获取文本。检查此示例：

>>> sel.css("[name^='quest']").xpath('./following-sibling::text()').extract()
[u'\n   It initiates the risk that malicious software is targeting the VM environment.\n   ', u'\n   ', u'\n   It increases overall security risk shared resources.\n   ', u'\n   ', u'\n   It creates the possibility that remote attestation may not work.\n   ', u'\n   ', u'\n   All of the above\n  ', u'\n   It increases overall security risk shared resources.\n   ', u'\n   ', u'\n   It creates the possibility that remote attestation may not work.\n   ', u'\n   ', u'\n   All of the above\n  ', u'\n   It creates the possibility that remote attestation may not work.\n   ', u'\n   ', u'\n   All of the above\n  ', u'\n   All of the above\n  ']

Answer 3

您目前无法仅使用CSS。

cssselect是response.css()背后的基础库，不支持选择同级文本。

最多您可以选择以下第一个元素：

>>> selector.css('[name^="quest"] + *').get()
'<br>'

无法触及目标元素外部的某些文本

3 个答案: