抓取以下页面:http://graphics.stltoday.com/apps/payrolls/salaries/teachers/detail/25074/并且我试图从表格中获取每个值(工资,职位,分区年份等)。当我尝试从scrapy shell访问这些内容时,当我使用response.xpath('//th[@scope="row"]/following-sibling::td[1]/text()').extract()
时,它们全部显示。但是,当我在爬虫中执行此操作时,只显示第一个元素(区域)。有什么建议吗?
抓取代码(理想情况下,每个元素都会进入自己的变量以获得更清晰的输出:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class Spider2(CrawlSpider):
#name of the spider
name = 'stlteacher'
#list of allowed domains
allowed_domains = ['graphics.stltoday.com']
#starting url for scraping
start_urls = ['http://graphics.stltoday.com/apps/payrolls/salaries/teachers/']
rules = [
Rule(LinkExtractor(
allow=['/apps/payrolls/salaries/teachers/[0-9]+/$']),
follow=True),
Rule(LinkExtractor(
allow=['/apps/payrolls/salaries/teachers/[0-9]+/position/[0-9]+/$']),
follow=True),
Rule(LinkExtractor(
allow=['/apps/payrolls/salaries/teachers/detail/[0-9]+/$']),
callback='parse_item',
follow=True),
]
#setting the location of the output csv file
custom_settings = {
'FEED_FORMAT' : "csv",
'FEED_URI' : 'tmp/stlteachers3.csv'
}
def parse_item(self, response):
#Remove XML namespaces
response.selector.remove_namespaces()
#Extract article information
url = response.url
name = response.xpath('//p[@class="table__title"]/text()').extract()
district = response.xpath('//th[@scope="row"]/following-sibling::td[1]/text()').extract()
for item in zip(name, district):
scraped_info = {
'url' : url,
'name' : item[0],
'district' : item[1],
}
yield scraped_info
答案 0 :(得分:3)
你的zip
有点令人困惑。如果要对整个表进行爬网,则需要遍历表行并查找行名称和值。
我用这段代码得到了相当不错的结果:
def parse_item(self, response):
name = response.xpath('//p[@class="table__title"]/text()').extract_first()
item = {
'name': name,
'url': response.url
}
for row in response.xpath('//th[@scope="row"]'):
row_name = row.xpath('text()').extract_first('').lower().strip(':')
row_value = row.xpath('following-sibling::td[1]/text()').extract_first()
item[row_name] = row_value
yield item
返回:
{
'name': 'Bracht, Nathan',
'url': 'http://graphics.stltoday.com/apps/payrolls/salaries/teachers/detail/25074/',
'district': 'Affton 101',
'school': 'Central Office',
'position': 'Central Office Admin.',
'degree earned': 'Doct',
'salary': '$152,000.00',
'extended contract pay': None,
'extra duty pay': None,
'total pay (all combined)': '$152,000.00',
'years in district': '5',
'years in mo schools': '19',
'multiple position detail': None
}