Scrapy-选择表单中的项目并提取出现的表格

时间:2019-07-17 15:53:42

标签: python scrapy

我正在尝试从网页中提取信息,这需要我从下拉列表中进行选择,并根据选择出现带有各种信息的表格。我要迭代并提取表信息的页面上有表单/列表的选择值列表。

网页:https://www.mcafee.com/enterprise/en-us/support/product-eol.html

import scrapy
from scrapy.spiders import Spider

product_names = ['Host Intrusion Prevention','McAfee Agent','Active Response','Database Security']

class McAfee_Spider(scrapy.Spider):
    name = 'McAfee'
    allowed_domains = 'mcafee.com'
    start_urls = 'https://www.mcafee.com/enterprise/en-us/support/product-eol.html'

    for product in product_names:
        def parse(self, response):
            scrapy.FormRequest.from_response(
            response,
            formxpath="//form[@id='selectProductArea']",
            formdata={
                "SelectProductArea" : product },
            clickdata = { "type": "select" },
            )

        def parse_table(self, response):
            product = response.xpath("//table[@class="general eoldynamicContent"]//tbody//tr//td[1]").extract()
            version = response.xpath("//table[@class="general eoldynamicContent"]//tbody//tr//td[2]").extract()
            eos_notif = response.xpath("//table[@class="general eoldynamicContent"]//tbody//tr//td[3]").extract()
            eol_date = response.xpath("//table[@class="general eoldynamicContent"]//tbody//tr//td[4]").extract()

我被困在如何形成提取的xpath上。我研究的所有示例都有我可以访问的类,但事实并非如此。另外,该网站要求我在基于选择的表格出现之前从表单/列表中单击,我使用的是“ FormRequest.from_response”方法,但不确定是否设置正确。

我要提取的信息是产品名称,版本型号,支持终止通知和寿命终止/支持终止信息。我想先将结果存储在数据框中,因为我需要将其他来源的信息结合起来,然后导出到excel / csv中。

https://www.mcafee.com/enterprise/en-us/support/product-eol.html的“主机入侵防护”列表中第一个产品的预期结果

import pandas as pd
results = {'product':['McAfee Host Intrusion Prevention', 'McAfee Host Prevention for Linux'],
          'version':['8.0','8.0 Patch 6'],
          'eos_notif':['',''],
          'eol_date':['','']}
pd.DataFrame(results)

1 个答案:

答案 0 :(得分:0)

您在错误的位置搜索。在列表中选择任何内容后,上述网站不会发送任何FormRequest。相反,它从https://www.mcafee.com/enterprise/admin/support/eol.xml加载所有内容并仅显示数据:

import scrapy


class McAfee_Spider(scrapy.Spider):
    name = 'McAfee'
    allowed_domains = 'mcafee.com'
    start_urls = ['https://www.mcafee.com/enterprise/admin/support/eol.xml']

    def parse(self, response):
        for product in response.xpath('//product'):
            product_title = product.xpath('./@title').get()
            for element in product.xpath('./element'):
                element_title = element.xpath('./@title').get()
                element_version = element.xpath('./@version').get()
                element_eos = element.xpath('./@eos').get()
                element_eos_notification = element.xpath('./@eos_notification').get()
                element_comment = element.xpath('./comment/text()').get()


                yield {
                    'product_title': product_title,
                    'element_title': element_title,
                    'element_version': element_version,
                    'element_eos': element_eos,
                    'element_eos_notification': element_eos_notification,
                    'element_commment': element_comment,
                }