我的问题与从坏表中获取数据有关,我正在做很多尝试而没有任何结果>。<也许有人有一个很棒的主意来解析一张桌子。
我需要做些什么:
表结构示例:
<tr class='divider'> # nonsense </tr>
<tr class='header'> # here's the name of the future file
<td> # nonsense </td>
<td class='title'>
<div class='titlebox'> # really ...
<span>
<strong> THE TITLE </strong> # That's a data to be scraped
</span>
</div>
</td>
<td> # nonsense </td>
</tr>
<tr class='item_data'> # Here's the content of the output file
<td> # nonsense </td>
<td class='justanumber'> # Data needed
1234
</td>
<td> # Here's a td without class but i need to get the data too
blahblahblah
</td>
<td> # Here's the same as before
123 blahblah
</td>
<td> # The same with a number
1234
</td>
<td class='nonsense'> # nonsense </td>
</tr>
<tr class='useless'> # Just useless data </tr>
<tr class='divider'> # And after that will be the next item </tr>
让项目数据变得令人难以置信,因为项目没有包装,我需要一些想法。我将更新scrapy代码示例:
faktable.py
# -*- coding: utf-8 -*-
import scrapy,urlparse
from Faktable.items import FaktableItem
from scrapy.http import FormRequest
from scrapy.http import Request
from scrapy.http import Response
class FaktableSpider(scrapy.Spider):
name = "faktable"
allowed_domains = ["faktable.com"]
start_urls = (
'http://faktable.com/login',
)
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'user':'name','pass':'thepass'},
callback=self.after_login)]
def after_login(self, response):
faktable_name = response.xpath('//div[@class=""]').extract()
for item in items:
yield
预期产出:
Title_data;td_data(the first 1234);td_data(the blahblah);td_data(the 123 blah);td_data(the second 1234);
Title_data;td_data(the first 1234);td_data(the blahblah);td_data(the 123 blah);td_data(the second 1234);
Title_data;td_data(the first 1234);td_data(the blahblah);td_data(the 123 blah);td_data(the second 1234);
Title_data;td_data(the first 1234);td_data(the blahblah);td_data(the 123 blah);td_data(the second 1234);
Title_data;td_data(the first 1234);td_data(the blahblah);td_data(the 123 blah);td_data(the second 1234);