Question

我目前正在使用scrapy和Python 3.6。我的目标是使用此类html代码从表中抓取所有数据：

<table class="table table-a">
                    <tbody><tr>
                        <td colspan="2">
                            <h2 class="text-center no-margin">Geometry</h2>
                        </td>
                    </tr>
                    <tr>
                        <td title="Depth of section">h = 267 mm</td>
                        <td rowspan="8" class="text-center">
                            <a href="http://www.staticstools.eu/assets/image/profile-ipea.png" target="_blank">
                                <img src="http://www.staticstools.eu/assets/image/profile-ipea-thumb.png" alt="Section IPEA" class="img-responsive">
                            </a>
                        </td>
                    </tr>
                    <tr>
                        <td title="Width of section">b = 135 mm</td>
                    </tr>
                    <tr>
                        <td title="Flange thickness">t<sub>f</sub> = 8.7 mm</td>
                    </tr>
                    <tr>
                        <td title="Web thickness">t<sub>w</sub> = 5.5 mm</td>
                    </tr>
                    <tr>
                        <td title="Radius of root fillet">r<sub>1</sub> = 15 mm</td>
                    </tr>
                    <tr>
                        <td title="Distance of centre of gravity along y-axis">y<sub>s</sub> = 67.5 mm</td>
                    </tr>
                    <tr>
                        <td title="Depth of straight portion of web">d = 219.6 mm</td>
                    </tr>
                    <tr>
                        <td title="Area of section">A = 3915 mm<sup>2</sup></td>
                    </tr>
                    <tr>
                        <td title="Painting surface per unit lenght">A<sub>L</sub> = 1.04 m<sup>2</sup>.m<sup>-1</sup></td>
                        <td title="Mass per unit lenght">G = 30.7 kg.m<sup>-1</sup></td>
                    </tr>
                </tbody></table>

在某些行中，我面临着<sup>和<sub>索引格式，这使所有工作变得更加困难。我的意思是，通过使用：

response.css('table.table.table-a td::text').extract()

输出为：

['\n                            ',
 '\n                        ',
 'h = 267 mm',
 '\n                            ',
 '\n                        ',
 'b = 135 mm',
 't',
 ' = 8.7 mm',
 't',
 ' = 5.5 mm',
 'r',
 ' = 15 mm',
 'y',
 ' = 67.5 mm',
 'd = 219.6 mm',
 'A = 3915 mm',
 'A',
 ' = 1.04 m',
 '.m',
 'G = 30.7 kg.m']

所以一切都有些混乱。我还可以使用以下方式添加嵌套标签：

response.css('table.table.table-a td *::text').extract()

具有这样的输出：

['\n                            ',
 'Geometry',
 '\n                        ',
 'h = 267 mm',
 '\n                            ',
 '\n                                ',
 '\n                            ',
 '\n                        ',
 'b = 135 mm',
 't',
 'f',
 ' = 8.7 mm',
 't',
 'w',
 ' = 5.5 mm',
 'r',
 '1',
 ' = 15 mm',
 'y',
 's',
 ' = 67.5 mm',
 'd = 219.6 mm',
 'A = 3915 mm',
 '2',
 'A',
 'L',
 ' = 1.04 m',
 '2',
 '.m',
 '-1',
 'G = 30.7 kg.m',
 '-1']

我当然可以对这些数据进行t周的后期处理，但是我想知道在抓取期间是否有可能实现？我希望我的输出数据如下：

 ['h = 267 mm',
     'b = 135 mm',
     'tf = 8.7 mm',
     'tw = 5.5 mm',
     'r1 = 15 mm', 
     'ys = 67.5 mm',
     'd = 219.6 mm',
     'A = 3915 mm2',
     'AL = 1.04 m2.m-1',
     'G = 30.7 kg.m-1']

Answer 1

是的，您可以在蜘蛛类的parse方法中根据需要处理尽可能多的数据。像下面这样的东西在这里工作：

import scrapy

class MySpider(scrapy.Spider):
    name = "myspider"

    def start_requests(self):
        urls = [
            'www.example.com'
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # perform data below

        data = response.xpath("//table").extract()

        data = pd.read_html(data[0])[0]

        # perform data processing above

        yield {'data':data}

运行以下命令将生成的df保存到json：

scrapy crawl myscraper -o table.json

如果要仔细看一些要插入到解析方法中的代码，请查看以下内容：

df = pd.read_html(html)[0]

df

    0               1
0   Geometry        NaN
1   h = 267 mm      NaN
2   b = 135 mm      NaN
3   tf = 8.7 mm     NaN
4   tw = 5.5 mm     NaN
5   r1 = 15 mm      NaN
6   ys = 67.5 mm    NaN
7   d = 219.6 mm    NaN
8   A = 3915 mm2    NaN
9   AL = 1.04 m2.m-1    G = 30.7 kg.m-1

df = pd.DataFrame([i.split(r' ') for i in df[0].map(str)])
df.drop([1,3], axis=1, inplace=True)

df

    0   2
0   Geometry    None
1   h   267
2   b   135
3   tf  8.7
4   tw  5.5
5   r1  15
6   ys  67.5
7   d   219.6
8   A   3915
9   AL  1.04

带有下标<sub>数据的刮板表

1 个答案: