用于启动URL scrapy的url生成器(仅读取第一个URL),为什么?

时间:2014-02-09 12:03:43

标签: python class url scrapy

我使用scrapy作为webscraping框架并为一组公司抓取许多不同的域。我生成了一个URL生成器类,它读取公司的文件并为不同网页上的公司生成起始URL(仅显示一个示例公司)。 Web scraper在第一条记录中运行正常,但不会为其他URL运行。我测试了URL生成器并返回所有URL,但由于某种原因,这不起作用start_urls = [start_url.company-site()]。有什么想法吗?

网址生成器文件。

# -*- coding: utf-8 -*-
import os 
import os.path

class URL(object):
    P=[]

    def read(self, filename):
        with open(filename) as f:
            for line in f:
                field = line.split(',')
                company = field[1].replace(" ", '+')
                adress="{0}+{1}".format(field[5],field[11])
                self.P.append("http://www.companywebpage.com/market-search?q={0}".format(company))

    def company-site(self):
        for i in self.P:
            return i

蜘蛛侠文件

root = os.getcwd()
start_url = URL()
p = os.path.join(root, 'Company_Lists', 'Test_of_company.csv')
start_url.read(p)

class company-spider(BaseSpider):
    name = "Company-page"
    allowed_domains = ["CompanyDomain.se"]
    start_urls = [start_url.company-site()]

1 个答案:

答案 0 :(得分:1)

替换

def company-site(self):
    for i in self.P:
        return i

def urls(self):
    for url in self.P:
        yield url

替换

start_urls = [start_url.company-site()]

start_urls = start_url.urls()

start_urls = start_url.P

因为Spider.start_requests看起来像这样:

def start_requests(self):
    for url in self.start_urls:
        yield self.make_requests_from_url(url)