This is a continuation of the question: Extract from dynamic JSON response with Scrapy
I have a Scrapy spider that extract values from a JSON response. It works well, extract the right values, but somehow it enters in a loop and returns more results than expected (duplicate results).
For example for 17 values provided in test.txt
file it returns 289
results, that means 17 times more
than expected.
Spider content below:
import scrapy
import json
from whois.items import WhoisItem
class whoislistSpider(scrapy.Spider):
name = "whois_list"
start_urls = []
f = open('test.txt', 'r')
global lines
lines = f.read().splitlines()
f.close()
def __init__(self):
for line in lines:
self.start_urls.append('http://www.example.com/api/domain/check/%s/com' % line)
def parse(self, response):
for line in lines:
jsonresponse = json.loads(response.body_as_unicode())
item = WhoisItem()
domain_name = list(jsonresponse['domains'].keys())[0]
item["avail"] = jsonresponse["domains"][domain_name]["avail"]
item["domain"] = domain_name
yield item
items.py content below
import scrapy
class WhoisItem(scrapy.Item):
avail = scrapy.Field()
domain = scrapy.Field()
pipelines.py below
class WhoisPipeline(object):
def process_item(self, item, spider):
return item
Thank you in advance for all the replies.
答案 0 :(得分:1)
The parse
function should be like this:
def parse(self, response):
jsonresponse = json.loads(response.body_as_unicode())
item = WhoisItem()
domain_name = list(jsonresponse['domains'].keys())[0]
item["avail"] = jsonresponse["domains"][domain_name]["avail"]
item["domain"] = domain_name
yield item
Notice that I removed the for
loop.
What was happening: for every single response you would loop and parse it 17 times. (Therefore resulting in 17*17 records)