Scrapy - 意外的后缀"%0A"在链接中

时间:2017-09-26 21:26:42

标签: python scrapy

我正在抓取网站从网站下载电子邮件地址。 我有一个简单的Scrapy爬虫,它使用带有域的.txt文件,然后抓取它们以查找电子邮件地址。

不幸的是,Scrapy正在添加后缀"%0A"在链接中。您可以在日志文件中看到它。

这是我的代码:

class EmailsearcherSpider(scrapy.Spider):
    name = 'emailsearcher'
    allowed_domains = []
    start_urls = []
    unique_data = set()

    def __init__(self):
        for line in open('/home/*****/domains',
                     'r').readlines():
            self.allowed_domains.append(line)
            self.start_urls.append('http://{}'.format(line))


    def parse(self, response):
        emails = response.xpath('//body').re('([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)')
        for email in emails:
            print(email)
            print('\n')
            if email and (email not in self.unique_data):
                self.unique_data.add(email)
                yield {'emails': email}

domains.txt:

link4.pl/kontakt
danone.pl/Kontakt
axadirect.pl/kontakt/dane-axa-direct.html
andrzejtucholski.pl/kontakt
premier.gov.pl/kontakt.html

以下是来自控制台的日志:

2017-09-26 22:27:02 [scrapy.core.engine] INFO: Spider opened
2017-09-26 22:27:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-26 22:27:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
2017-09-26 22:27:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.premier.gov.pl/kontakt.html> from <GET http://premier.gov.pl/kontakt.html>
2017-09-26 22:27:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://andrzejtucholski.pl/kontakt> from <GET http://andrzejtucholski.pl/kontakt%0A>
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://axadirect.pl/kontakt/dane-axa-direct.html%0A> from <GET http://axadirect.pl/kontakt/dane-axa-direct.html%0A>
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.link4.pl/kontakt> from <GET http://link4.pl/kontakt%0A>
2017-09-26 22:27:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://danone.pl/Kontakt%0a> from <GET http://danone.pl/Kontakt%0A>

2 个答案:

答案 0 :(得分:0)

%0A是换行符。读取线条可以保持换行符不变。要摆脱它们,您可以使用string.strip函数,如下所示:

            self.start_urls.append('http://{}'.format(string.strip(line)))

答案 1 :(得分:0)

我找到了正确的解决方案。我不得不使用 rstrip 功能。

 self.start_urls.append('http://{}'.format(line.rstrip()))