我目前正在使用scrapy和Python 3.6。我的目标是使用此类html代码从表中抓取所有数据:
<table class="table table-a">
<tbody><tr>
<td colspan="2">
<h2 class="text-center no-margin">Geometry</h2>
</td>
</tr>
<tr>
<td title="Depth of section">h = 267 mm</td>
<td rowspan="8" class="text-center">
<a href="http://www.staticstools.eu/assets/image/profile-ipea.png" target="_blank">
<img src="http://www.staticstools.eu/assets/image/profile-ipea-thumb.png" alt="Section IPEA" class="img-responsive">
</a>
</td>
</tr>
<tr>
<td title="Width of section">b = 135 mm</td>
</tr>
<tr>
<td title="Flange thickness">t<sub>f</sub> = 8.7 mm</td>
</tr>
<tr>
<td title="Web thickness">t<sub>w</sub> = 5.5 mm</td>
</tr>
<tr>
<td title="Radius of root fillet">r<sub>1</sub> = 15 mm</td>
</tr>
<tr>
<td title="Distance of centre of gravity along y-axis">y<sub>s</sub> = 67.5 mm</td>
</tr>
<tr>
<td title="Depth of straight portion of web">d = 219.6 mm</td>
</tr>
<tr>
<td title="Area of section">A = 3915 mm<sup>2</sup></td>
</tr>
<tr>
<td title="Painting surface per unit lenght">A<sub>L</sub> = 1.04 m<sup>2</sup>.m<sup>-1</sup></td>
<td title="Mass per unit lenght">G = 30.7 kg.m<sup>-1</sup></td>
</tr>
</tbody></table>
在某些行中,我面临着<sup>
和<sub>
索引格式,这使所有工作变得更加困难。我的意思是,通过使用:
response.css('table.table.table-a td::text').extract()
输出为:
['\n ',
'\n ',
'h = 267 mm',
'\n ',
'\n ',
'b = 135 mm',
't',
' = 8.7 mm',
't',
' = 5.5 mm',
'r',
' = 15 mm',
'y',
' = 67.5 mm',
'd = 219.6 mm',
'A = 3915 mm',
'A',
' = 1.04 m',
'.m',
'G = 30.7 kg.m']
所以一切都有些混乱。我还可以使用以下方式添加嵌套标签:
response.css('table.table.table-a td *::text').extract()
具有这样的输出:
['\n ',
'Geometry',
'\n ',
'h = 267 mm',
'\n ',
'\n ',
'\n ',
'\n ',
'b = 135 mm',
't',
'f',
' = 8.7 mm',
't',
'w',
' = 5.5 mm',
'r',
'1',
' = 15 mm',
'y',
's',
' = 67.5 mm',
'd = 219.6 mm',
'A = 3915 mm',
'2',
'A',
'L',
' = 1.04 m',
'2',
'.m',
'-1',
'G = 30.7 kg.m',
'-1']
我当然可以对这些数据进行t周的后期处理,但是我想知道在抓取期间是否有可能实现?我希望我的输出数据如下:
['h = 267 mm',
'b = 135 mm',
'tf = 8.7 mm',
'tw = 5.5 mm',
'r1 = 15 mm',
'ys = 67.5 mm',
'd = 219.6 mm',
'A = 3915 mm2',
'AL = 1.04 m2.m-1',
'G = 30.7 kg.m-1']
答案 0 :(得分:1)
是的,您可以在蜘蛛类的parse方法中根据需要处理尽可能多的数据。像下面这样的东西在这里工作:
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
def start_requests(self):
urls = [
'www.example.com'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# perform data below
data = response.xpath("//table").extract()
data = pd.read_html(data[0])[0]
# perform data processing above
yield {'data':data}
运行以下命令将生成的df保存到json:
scrapy crawl myscraper -o table.json
如果要仔细看一些要插入到解析方法中的代码,请查看以下内容:
df = pd.read_html(html)[0]
df
0 1
0 Geometry NaN
1 h = 267 mm NaN
2 b = 135 mm NaN
3 tf = 8.7 mm NaN
4 tw = 5.5 mm NaN
5 r1 = 15 mm NaN
6 ys = 67.5 mm NaN
7 d = 219.6 mm NaN
8 A = 3915 mm2 NaN
9 AL = 1.04 m2.m-1 G = 30.7 kg.m-1
df = pd.DataFrame([i.split(r' ') for i in df[0].map(str)])
df.drop([1,3], axis=1, inplace=True)
df
0 2
0 Geometry None
1 h 267
2 b 135
3 tf 8.7
4 tw 5.5
5 r1 15
6 ys 67.5
7 d 219.6
8 A 3915
9 AL 1.04