Scrapy有助于从坏表中获取数据

时间:2018-01-30 08:48:19

标签: python scrapy scrapy-spider

我的问题与从坏表中获取数据有关,我正在做很多尝试而没有任何结果>。<也许有人有一个很棒的主意来解析一张桌子。

我需要做些什么:

  • 登录网页。 [DONE]
  • 从该表的每个项目中获取数据,并为包含该数据的项目创建一个文件。 [UNDONE(除了文件制作容易的事)]

表结构示例:

<tr class='divider'> # nonsense </tr>
<tr class='header'> # here's the name of the future file
 <td> # nonsense </td>
 <td class='title'>
  <div class='titlebox'> # really ...
   <span>
    <strong> THE TITLE </strong> # That's a data to be scraped
   </span>
  </div>
 </td>
 <td> # nonsense </td>
</tr>
<tr class='item_data'> # Here's the content of the output file
 <td> # nonsense </td>
 <td class='justanumber'> # Data needed
  1234
 </td>
 <td> # Here's a td without class but i need to get the data too
  blahblahblah
 </td>
 <td> # Here's the same as before
  123 blahblah
 </td>
 <td> # The same with a number
  1234
 </td>
 <td class='nonsense'> # nonsense </td>
</tr>
<tr class='useless'> # Just useless data </tr>
<tr class='divider'> # And after that will be the next item </tr>

让项目数据变得令人难以置信,因为项目没有包装,我需要一些想法。我将更新scrapy代码示例:

faktable.py

# -*- coding: utf-8 -*-
import scrapy,urlparse
from Faktable.items import FaktableItem
from scrapy.http import FormRequest
from scrapy.http import Request
from scrapy.http import Response
class FaktableSpider(scrapy.Spider):
    name = "faktable"
    allowed_domains = ["faktable.com"]
    start_urls = (
        'http://faktable.com/login',
    )
    def parse(self, response):
        return [FormRequest.from_response(response,
            formdata={'user':'name','pass':'thepass'},
            callback=self.after_login)]
    def after_login(self, response):
        faktable_name = response.xpath('//div[@class=""]').extract()
        for item in items:
            yield

预期产出:

Title_data;td_data(the first 1234);td_data(the blahblah);td_data(the 123 blah);td_data(the second 1234);
Title_data;td_data(the first 1234);td_data(the blahblah);td_data(the 123 blah);td_data(the second 1234);
Title_data;td_data(the first 1234);td_data(the blahblah);td_data(the 123 blah);td_data(the second 1234);
Title_data;td_data(the first 1234);td_data(the blahblah);td_data(the 123 blah);td_data(the second 1234);
Title_data;td_data(the first 1234);td_data(the blahblah);td_data(the 123 blah);td_data(the second 1234);

0 个答案:

没有答案