我正在尝试从网页中提取信息,这需要我从下拉列表中进行选择,并根据选择出现带有各种信息的表格。我要迭代并提取表信息的页面上有表单/列表的选择值列表。
网页:https://www.mcafee.com/enterprise/en-us/support/product-eol.html
import scrapy
from scrapy.spiders import Spider
product_names = ['Host Intrusion Prevention','McAfee Agent','Active Response','Database Security']
class McAfee_Spider(scrapy.Spider):
name = 'McAfee'
allowed_domains = 'mcafee.com'
start_urls = 'https://www.mcafee.com/enterprise/en-us/support/product-eol.html'
for product in product_names:
def parse(self, response):
scrapy.FormRequest.from_response(
response,
formxpath="//form[@id='selectProductArea']",
formdata={
"SelectProductArea" : product },
clickdata = { "type": "select" },
)
def parse_table(self, response):
product = response.xpath("//table[@class="general eoldynamicContent"]//tbody//tr//td[1]").extract()
version = response.xpath("//table[@class="general eoldynamicContent"]//tbody//tr//td[2]").extract()
eos_notif = response.xpath("//table[@class="general eoldynamicContent"]//tbody//tr//td[3]").extract()
eol_date = response.xpath("//table[@class="general eoldynamicContent"]//tbody//tr//td[4]").extract()
我被困在如何形成提取的xpath上。我研究的所有示例都有我可以访问的类,但事实并非如此。另外,该网站要求我在基于选择的表格出现之前从表单/列表中单击,我使用的是“ FormRequest.from_response”方法,但不确定是否设置正确。
我要提取的信息是产品名称,版本型号,支持终止通知和寿命终止/支持终止信息。我想先将结果存储在数据框中,因为我需要将其他来源的信息结合起来,然后导出到excel / csv中。
https://www.mcafee.com/enterprise/en-us/support/product-eol.html的“主机入侵防护”列表中第一个产品的预期结果
import pandas as pd
results = {'product':['McAfee Host Intrusion Prevention', 'McAfee Host Prevention for Linux'],
'version':['8.0','8.0 Patch 6'],
'eos_notif':['',''],
'eol_date':['','']}
pd.DataFrame(results)
答案 0 :(得分:0)
您在错误的位置搜索。在列表中选择任何内容后,上述网站不会发送任何FormRequest
。相反,它从https://www.mcafee.com/enterprise/admin/support/eol.xml
加载所有内容并仅显示数据:
import scrapy
class McAfee_Spider(scrapy.Spider):
name = 'McAfee'
allowed_domains = 'mcafee.com'
start_urls = ['https://www.mcafee.com/enterprise/admin/support/eol.xml']
def parse(self, response):
for product in response.xpath('//product'):
product_title = product.xpath('./@title').get()
for element in product.xpath('./element'):
element_title = element.xpath('./@title').get()
element_version = element.xpath('./@version').get()
element_eos = element.xpath('./@eos').get()
element_eos_notification = element.xpath('./@eos_notification').get()
element_comment = element.xpath('./comment/text()').get()
yield {
'product_title': product_title,
'element_title': element_title,
'element_version': element_version,
'element_eos': element_eos,
'element_eos_notification': element_eos_notification,
'element_commment': element_comment,
}