Question

我在txt文件中有一个url列表，然后是一个contact_page_patterns列表。我只需要检查那些特定页面来抓取网址的电子邮件。

请告诉我一些可行的方法。我是Python和Scrapy的新手。先感谢您。

   class FinalspiderSpider(scrapy.Spider):
       name = "finalspider"
       source_urls = open("/Users/NiveRam/Documents/urllist.txt","rb")
       start_urls = [url.strip() for url in source_urls.readlines()]
       contact_page_pattern = ['help','office','global','feedback','branch','contact','about']

       def parse(self, response):
           hxs = HtmlXPathSelector(response)
           emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.body)
           story = FinaltestItem()
           story["url"] = response.url
           story["title"] = response.xpath("//title/text()").extract()
           story["email"] = emails
           return(story)

这将从网页的整个主体中检索电子邮件，并输出

等电子邮件

电子邮件：[info @ abc.com，infor @ abc.com，yourname @ abc.com]

Answer 1

您可以通过url对象的response属性访问当前网址：

class MySpider(scrapy.Spider):
    url_keywords = ['stackoverflow', 'tea']

    def parse(self, response):
        story = FinaltestItem()
        # check if any of defined keywords can be found in response.url
        get_email = any(k in response.url for k in self.url_keywords)
        if get_email:  # if yes add in email!
            emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.body)
            story["email"] = emails
        story["url"] = response.url
        story["title"] = response.xpath("//title/text()").extract()
        return story

从给定网址列表

1 个答案: