我正试图通过此页面使用Scrapy,Python和RegEx来抓取邮件ID:https://allevents.in/bangalore/project-based-summer-training-program/1851553244864163。
为此,我编写了以下命令,每个命令都返回一个空列表:
response.xpath('//a/*[@href = "#"]/text()').extract()
response.xpath('//a/@onclick').extract()
response.xpath('//a/@onclick/text()').extract()
response.xpath('//span/*[@class = ""]/a/text()').extract()
response.xpath('//a/@onclick/text()').extract()
除此之外,我还计划使用RegEx从描述中删除电子邮件ID。为此,我编写了命令来删除除了描述末尾的电子邮件ID之外的所有内容:
response.xpath('//*[@property = "schema:description"]/text()').extract()
上述命令的输出为:
[u'\n\t\t\t\t\t\t\t "Your Future is created by what you do today Let\'s shape it With Summer Training Program \u2026\u2026\u2026 ."', u'\n', u'\nWith ever changing technologies & methodologies, the competition today is much greater than ever before. The industrial scenario needs constant technical enhancements to cater to the rapid demands.', u'\nHT India Labs is presenting Summer Training Program to acquire and clear your concepts about your respective fields. ', u'\nEnroll on ', u' and avail Early bird Discounts.', u'\n', u'\nFor Registration or Enquiry call 9911330807, 7065657373 or write us at ', u'\t\t\t\t\t\t']
答案 0 :(得分:1)
我对onclick
事件属性知之甚少。我想,当它设置为return false
时,请求通常会跳过该部分。但是,如果您尝试我在下面显示的方式,您可能会得到非常接近您想要的结果。
import requests
from scrapy import Selector
res = requests.get("https://allevents.in/bangalore/project-based-summer-training-program/1851553244864163")
sel = Selector(res)
for items in sel.css("div[property='schema:description']"):
emailid = items.css("span::text").extract_first()
print(emailid)
输出:
htindialabsworkshops | gmail ! com