
时间:2017-02-02 05:09:52

标签: python web-scraping spyder

我有一个python spider脚本,它只会丢弃url。但它只需要一个url作为输入。我有一个很大的域名输入txt文件列表,想要处理它们并将输出保存到txt文件。


1 个答案:

答案 0 :(得分:0)


scrapy crawl google_parser  > output.txt


<强> google_parser.py

import sys
from urllib.parse import urlparse
from scrapy import Spider, Request, spidermiddlewares

class MySpider(Spider):
    name = 'google_parser'
    allowed_domains = []

    def start_requests(self):
        with sys.stdin as f:
            urls = [x.strip() for x in f.readlines()]
        self.allowed_domains = [urlparse(url).hostname for url in urls]
        # Refresh the regex cache for `allowed_domains`
        # thx to - http://stackoverflow.com/questions/5161815/dynamically-add-to-allowed-domains-in-a-scrapy-spider
        for mw in self.crawler.engine.scraper.spidermw.middlewares:
            if isinstance(mw, spidermiddlewares.offsite.OffsiteMiddleware):
        for url in urls:
                yield Request(url)

    def parse(self, response):
        for url in response.xpath('//a/@href').extract():
            new_url = response.urljoin(url)
            yield Request(new_url)


cat urls.txt | scrapy crawl google_parser


['http://www.com', 'http://www.me',]


scrapy crawl google_parser < urls.txt 


scrapy crawl google_parser < urls.txt > output.txt


cat urls.txt | grep '/script.php?' | head -5 | scrapy crawl google_parser > output.txt