使用Scrapy

时间:2017-07-06 02:34:57

标签: python web-scraping scrapy

我遇到了从抓取数据构造csv类型数据文件的问题。我已经设法从表中删除了数据但是在编写它时我几天都不能这样做。我正在使用项目并尝试将其写入pandas数据框。我正在使用物品清单。

import scrapy
from wiki.items import WikiItem
import pandas as pd

class Spider(scrapy.Spider):

name = "wiki"
start_urls = ['https://datatables.net/']

def parse(self, response):

    items = {'Name':[], 'Position':[], 'Office':[], 'Age':[],
        'Start_Date':[],'Salary':[]}

    trs = response.xpath('//table[@id="example"]//tr')
    name = WikiItem()
    pos = WikiItem()
    office = WikiItem()
    age = WikiItem()
    start_data = WikiItem()
    salary = WikiItem()

    name['name'] = trs.xpath('//td[1]//text()').extract()
    pos['position'] = trs.xpath('//td[2]//text()').extract()
    office['office'] = trs.xpath('//td[3]//text()').extract()
    age['age'] = trs.xpath('//td[4]//text()').extract()
    start_data['start_data'] = trs.xpath('//td[5]//text()').extract()
    salary['salary'] = trs.xpath('td[6]//text()').extract()

    items['Name'].append(name)
    items['Position'].append(pos)
    items['Office'].append(office)
    items['Age'].append(age)
    items['Start_Date'].append(start_data)
    items['Salary'].append(salary)

    x = pd.DataFrame(items, columns=['Name','Position','Office','Age',
        'Start_Date','Salary'])

    yield x.to_csv("r",sep=",")

从这段代码我得到的是这样的;

,Name,Position,Office,Age,Start_Date,Salary
0,"{'name': [u'Tiger Nixon',
      u'Garrett Winters',
      u'Ashton Cox',
      u'Cedric Kelly',
      u'Airi Satou',
      u'Brielle Williamson',
      u'Herrod Chandler',

我得到了名字栏,但我得到了59次。例如我有第一排,'老虎尼克松'59次。我也获得59次位置列,依此类推。并且刮削的数据也不是很好。我是scrapy的新手并且愿意接受任何帮助或建议。提前谢谢!

编辑:我的items.py是这样的;

import scrapy


class WikiItem(scrapy.Item):


name = scrapy.Field()
position = scrapy.Field()
office = scrapy.Field()
age = scrapy.Field()
start_data = scrapy.Field()
salary = scrapy.Field()

1 个答案:

答案 0 :(得分:3)

好的,我无法发表评论,因为我没有WikiItem的定义,所以无法测试您的代码。但是让迭代这个回复,好吗? 你能用这段代码检查一下吗?

class Spider(scrapy.Spider):

    name = "wiki"
    start_urls = ['https://datatables.net/']

    def parse(self, response):

        trs = response.xpath('//table[@id="example"]//tr')

        if trs:
            items = []
            for tr in trs:
                print tr.xpath('td[2]//text()').extract()
                item = {
                    "Name": tr.xpath('td[1]//text()').extract(),
                    "Position": tr.xpath('td[2]//text()').extract(),
                    "Office": tr.xpath('td[3]//text()').extract(),
                    "Age": tr.xpath('td[4]//text()').extract(),
                    "Start_Date": tr.xpath('td[5]//text()').extract(),
                    "Salary": tr.xpath('td[6]//text()').extract()
                }
                items.append(item)


            x = pd.DataFrame(items, columns=['Name','Position','Office','Age',
                'Start_Date','Salary'])

            yield x.to_csv("r",sep=",")