我想从论坛帖子中提取文字数据。这是我的工作蜘蛛:
import scrapy
import csv #not used yet
class QuotesSpider(scrapy.Spider):
name = "quotes2"
start_urls = [
'https://www.motor-talk.de/faq/mercedes-e-klasse-w210-q89.html#Q3512477',
]
def parse(self, response):
xString= ' '
xStringLink = ' '
for i in range(4, 6): # start, stop
xString='//*[@id="questions"]/div[2]/div['+str(i)+']/div[2]/div[1]/table/tr/td/div/text()'
xStringLink='//*[@id="questions"]/div[2]/div['+str(i)+']/a/@name'
scraped_info = {
'forum post': response.xpath(xString).extract(),
'link': response.xpath(xStringLink).extract()
}
yield scraped_info
但是,我只得到以下输出:正如您所看到的,论坛帖子被截断:(:
(...)
2017-11-19 15:52:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.motor-talk.de/faq/mercedes-e-klasse-w210-q89.html#Q3512477> (referer: None)
2017-11-19 15:52:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.motor-talk.de/faq/mercedes-e-klasse-w210-q89.html>
{'forum post': ['BR210, alle Modelle mit Hersteller-Schlüssel-Nr., Typ-Schlüssel-Nr., Stückzahlen CDI Common-Rail...'], 'link': ['Q3594587']}
2017-11-19 15:52:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.motor-talk.de/faq/mercedes-e-klasse-w210-q89.html>
{'forum post': ['Als Neujahrs-Gruß nachfolgend eine Aufstellung der Fahrgestell-Indent-Nummern, sortiert nach Prod...'], 'link': ['Q5160969']}
2017-11-19 15:52:30 [scrapy.core.engine] INFO: Closing spider (finished)
(...)
实际上,帖子要长得多,但字符串只会被切断。