我遇到了从抓取数据构造csv类型数据文件的问题。我已经设法从表中删除了数据但是在编写它时我几天都不能这样做。我正在使用项目并尝试将其写入pandas数据框。我正在使用物品清单。
import scrapy
from wiki.items import WikiItem
import pandas as pd
class Spider(scrapy.Spider):
name = "wiki"
start_urls = ['https://datatables.net/']
def parse(self, response):
items = {'Name':[], 'Position':[], 'Office':[], 'Age':[],
'Start_Date':[],'Salary':[]}
trs = response.xpath('//table[@id="example"]//tr')
name = WikiItem()
pos = WikiItem()
office = WikiItem()
age = WikiItem()
start_data = WikiItem()
salary = WikiItem()
name['name'] = trs.xpath('//td[1]//text()').extract()
pos['position'] = trs.xpath('//td[2]//text()').extract()
office['office'] = trs.xpath('//td[3]//text()').extract()
age['age'] = trs.xpath('//td[4]//text()').extract()
start_data['start_data'] = trs.xpath('//td[5]//text()').extract()
salary['salary'] = trs.xpath('td[6]//text()').extract()
items['Name'].append(name)
items['Position'].append(pos)
items['Office'].append(office)
items['Age'].append(age)
items['Start_Date'].append(start_data)
items['Salary'].append(salary)
x = pd.DataFrame(items, columns=['Name','Position','Office','Age',
'Start_Date','Salary'])
yield x.to_csv("r",sep=",")
从这段代码我得到的是这样的;
,Name,Position,Office,Age,Start_Date,Salary
0,"{'name': [u'Tiger Nixon',
u'Garrett Winters',
u'Ashton Cox',
u'Cedric Kelly',
u'Airi Satou',
u'Brielle Williamson',
u'Herrod Chandler',
我得到了名字栏,但我得到了59次。例如我有第一排,'老虎尼克松'59次。我也获得59次位置列,依此类推。并且刮削的数据也不是很好。我是scrapy的新手并且愿意接受任何帮助或建议。提前谢谢!
编辑:我的items.py是这样的;
import scrapy
class WikiItem(scrapy.Item):
name = scrapy.Field()
position = scrapy.Field()
office = scrapy.Field()
age = scrapy.Field()
start_data = scrapy.Field()
salary = scrapy.Field()
答案 0 :(得分:3)
好的,我无法发表评论,因为我没有WikiItem的定义,所以无法测试您的代码。但是让迭代这个回复,好吗? 你能用这段代码检查一下吗?
class Spider(scrapy.Spider):
name = "wiki"
start_urls = ['https://datatables.net/']
def parse(self, response):
trs = response.xpath('//table[@id="example"]//tr')
if trs:
items = []
for tr in trs:
print tr.xpath('td[2]//text()').extract()
item = {
"Name": tr.xpath('td[1]//text()').extract(),
"Position": tr.xpath('td[2]//text()').extract(),
"Office": tr.xpath('td[3]//text()').extract(),
"Age": tr.xpath('td[4]//text()').extract(),
"Start_Date": tr.xpath('td[5]//text()').extract(),
"Salary": tr.xpath('td[6]//text()').extract()
}
items.append(item)
x = pd.DataFrame(items, columns=['Name','Position','Office','Age',
'Start_Date','Salary'])
yield x.to_csv("r",sep=",")