动态字段的scrapy项

时间:2018-04-15 13:16:57

标签: python scrapy

项目字段需要根据<{1}}设置在

上的索引进行更改

例如

start_urls

所发生的事情是 location = input("Location:") second_location = input("Second Location:") start_urls = [ "https://www.yellowpages.com/search?search_terms=" + search_item + "&geo_location_terms=" + location, "https://www.yellowpages.com/search?search_terms=" + search_item + "&geo_location_terms=" + second_location # "https://www.yellowpages.com/search?search_terms=" + search_item + "&geo_location_terms=" + third_location, # "https://www.yellowpages.com/search?search_terms=" + search_item + "&geo_location_terms=" + fourth_location ] if self.start_urls[0]: item['location'] = location if self.start_urls[1]: item['location'] = second_location 将被修复并且不会动态变化,使得所有项目输出位置都是位置值,尽管它是item['location']

这是我到目前为止所做的。

items.py

self.start_urls[1]

myspider.py

class Item(scrapy.Item):
    business_name = scrapy.Field()
    website = scrapy.Field()
    phonenumber = scrapy.Field()
    email = scrapy.Field()
    location = scrapy.Field()


    # third_location = scrapy.Field()
    # fourth_location = scrapy.Field()
    visit_id = scrapy.Field()
    visit_status = scrapy.Field()

1 个答案:

答案 0 :(得分:0)

你的代码毫无意义。

if self.start_urls[0]:
    item['location'] = location

if self.start_urls[1]:
    item['location'] = second_location

只要start_urls的元素不是空字符串(或其他虚假值),就会执行这两个块。

如果我正确理解您的问题,您希望item['location']与起始网址中使用的位置相同。最简单的方法是让您的请求保存此信息。

您应该在start_requests()中制作自定义请求,并使用https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments中描述的方法将位置作为请求元数据传递。

之后,只需将其传递给任何后续请求。