我目前正在首次使用Scrapy开发刮刀,我也是第一次使用Yield。我对收益率如何运作感到非常困惑。
The Scraper:
然后在这些页面上,我想废弃所有列表并从中提取数据。这些单独的商家信息也有4个标签'需要刮掉。
var fArray = _.remove(arr, function(obj) {
return ( obj.value != '' ); });
刮刀目前似乎在某种程度上处理问题。目前主要关注的是:
错误:Spider必须返回Request,BaseItem,dict或None,得到' str'在
还有一些被删除的网址重复。我想知道(a)导致上述错误的原因和(b)产量设置是否正确格式化?
答案 0 :(得分:0)
您正在使用其中一种解析方法返回字符串:
def parse_individual_tabs(self, response):
data = {}
rows = response.xpath('//div[@id="tabContent"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div',{'id':'fieldset_data'})
for field in fields:
data[field.label.text.strip()] = field.p.text.strip()
yield json.dumps(data) #<---- here
正如错误消息所示:
错误:蜘蛛必须返回Request,BaseItem,dict或None,得到'str'
所以只需要yield data
,因为数据是一个字典。
编辑:
关于你的第二个问题 - 你的最后两个解析方法存在问题:
def parse_individual_listings(self, response):
# <..>
data = {}
for field in fields:
data[field.label.text.strip()] = field.p.text.strip()
# <..>
for link in links:
yield scrapy.Request(
urlparse.urljoin(response.url, link['href']),
callback=self.parse_individual_tabs,
meta={'data': data} # <-- you carry data here to below
)
def parse_individual_tabs(self, response):
data = {} # <--- here's the issue
# instead you should retrieve data carried from above:
data = response.meta['data']
# <..>
for field in fields:
data[field.label.text.strip()] = field.p.text.strip()
return data # also cine there's only 1 element you can just return it instead of yielding, it makes no difference
答案 1 :(得分:0)
您正在parse_individual_listings中产生请求,因此无需在parse_individual_tabs中生成数据。因为生成器很懒,所以需要返回运行它。
更正代码:
import json
导入scrapy 来自bs4 import BeautifulSoup
类MyScraper(scrapy.Spider): name =“myscraper”
start_urls = [
]
def parse(self, response):
rows = response.css('table.apas_tbl tr').extract()
for row in rows[1:]:
soup = BeautifulSoup(row, 'lxml')
url = soup.find_all("a")[1]['href']
yield scrapy.Request(url, callback=self.parse_page_contents)
def parse_page_contents(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
pages = soup.find(id='apas_form_text')
for link in pages.find_all('a'):
url = link['href']
yield scrapy.Request(url, callback=self.parse_page_listings)
def parse_page_listings(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
resultTable = soup.find("table", {"class": "apas_tbl"})
for row in resultTable.find_all('a'):
url = row['href']
yield scrapy.Request(url, callback=self.parse_individual_listings)
def parse_individual_listings(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div', {'id': 'fieldset_data'})
data = {}
for field in fields:
data[field.label.text.strip()] = field.p.text.strip()
tabs = response.xpath('//div[@id="tabheader"]').extract_first()
soup = BeautifulSoup(tabs, 'lxml')
links = soup.find_all("a")
for link in links:
yield scrapy.Request(
urljoin(response.url, link['href']),
callback=self.parse_individual_tabs,
meta={'data': data}
)
print
data
def parse_individual_tabs(self, response):
data = {}
rows = response.xpath('//div[@id="tabContent"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div', {'id': 'fieldset_data'})
for field in fields:
data[field.label.text.strip()] = field.p.text.strip()
return json.dumps(data)