我目前正在首次使用Scrapy开发刮刀,我也是第一次使用Yield。我仍然试图围绕屈服。
The Scraper:
我很难理解如何将parse_individual_tabs和parse_individual_listings中的JSON组合成一个JSON字符串。这将是每个单独列表的一个,并将发送到API。即使只是暂时打印也行不通。
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
'',
]
def parse(self, response):
rows = response.css('table.apas_tbl tr').extract()
for row in rows[1:]:
soup = BeautifulSoup(row, 'lxml')
dates = soup.find_all('input')
url = ""
yield scrapy.Request(url, callback=self.parse_page_contents)
def parse_page_contents(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
pages = soup.find(id='apas_form_text')
urls = []
urls.append(response.url)
for link in pages.find_all('a'):
urls.append('/'.format(link['href']))
for url in urls:
yield scrapy.Request(url, callback=self.parse_page_listings)
def parse_page_listings(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
resultTable = soup.find("table", { "class" : "apas_tbl" })
for row in resultTable.find_all('a'):
url = ""
yield scrapy.Request(url, callback=self.parse_individual_listings)
def parse_individual_listings(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div',{'id':'fieldset_data'})
for field in fields:
print field.label.text.strip()
print field.p.text.strip()
tabs = response.xpath('//div[@id="tabheader"]').extract_first()
soup = BeautifulSoup(tabs, 'lxml')
links = soup.find_all("a")
for link in links:
yield scrapy.Request( urlparse.urljoin(response.url, link['href']), callback=self.parse_individual_tabs)
要:
def parse_individual_listings(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div',{'id':'fieldset_data'})
data = {}
for field in fields:
data[field.label.text.strip()] = field.p.text.strip()
tabs = response.xpath('//div[@id="tabheader"]').extract_first()
soup = BeautifulSoup(tabs, 'lxml')
links = soup.find_all("a")
for link in links:
yield scrapy.Request(
urlparse.urljoin(response.url, link['href']),
callback=self.parse_individual_tabs,
meta={'data': data}
)
print data
...
def parse_individual_tabs(self, response):
data = {}
rows = response.xpath('//div[@id="tabContent"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div',{'id':'fieldset_data'})
for field in fields:
data[field.label.text.strip()] = field.p.text.strip()
print json.dumps(data)
到
def parse_individual_tabs(self, response):
data = {}
rows = response.xpath('//div[@id="tabContent"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
fields = soup.find_all('div',{'id':'fieldset_data'})
for field in fields:
data[field.label.text.strip()] = field.p.text.strip()
yield json.dumps(data)
答案 0 :(得分:2)
通常在获取数据时,您必须使用Scrapy Items
,但它们也可以替换为字典(这将是您所指的JSON对象),因此我们现在将使用它们:
首先,开始在parse_individual_listings
方法中创建项目(或字典),就像在data
中使用parse_individual_tabs
一样。然后将其传递给下一个请求(parse_individual_tabs
将使用meta
参数捕获,因此它应该如下所示:
def parse_individual_listings(self, response):
...
data = {}
data[field1] = 'data1'
data[field1] = 'data2'
...
yield scrapy.Request(
urlparse.urljoin(response.url, link['href']),
callback=self.parse_individual_tabs,
meta={'data': data};
)
然后,您可以在parse_individual_tabs
中获取该数据:
def parse_individual_tabs(self, response):
data = response.meta['data']
...
# keep populating `data`
yield data
现在data
中的parse_individual_tabs
包含您要求的所有信息,您可以在任何回调请求之间执行相同操作。