如何使用基于文本字符串的xpath刮擦html(xml)表行中的数据及其子项的值?

时间:2018-07-01 00:38:55

标签: xpath web-scraping

这是我要使用Xpath抓取的html:

<table class="ClassGrid" cellspacing="0" cellpadding="0" border="0" id="_ctl0_phMainContent_dgrdClasses" style="border-collapse:collapse;">
<tbody>
    <tr>
        <td class="ClassGridRow1" colspan="3">
            <hr>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1">Address 123
                <br>
                <br><a target="_blank" class="gridDirections" href="/Classes/Directions.aspx#104">Directions</a></div>
        </td>
        <td class="ClassGridRow2">
          <div class="ClassGridBox2">12/12/2018</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl3_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4233&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
          <div class="ClassGridBox2">1/24/2019</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl4_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4306&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Saturday, August 4</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBoxNone"></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Saturday, August 18</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl6_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4346&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Thursday, August 30</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl7_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4313&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Saturday, September 8</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl8_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4330&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Tuesday, September 18</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl9_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4331&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1" colspan="3">
            <hr>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1">Address 0000
                <br><a target="_blank" class="gridDirections" href="/Classes/Directions.aspx#190">Directions</a></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Saturday, July 21</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl11_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4242&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Tuesday, August 28</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl12_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4243&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Tuesday, September 25</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl13_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4271&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1" colspan="3">
            <hr>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1">Address 456
                <br><a target="_blank" class="gridDirections" href="/Classes/Directions.aspx#69">Directions</a></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Wednesday, August 1</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl15_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4276&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Saturday, August 25</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl16_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4277&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Thursday, September 13</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl17_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4348&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Saturday, October 6</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl18_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4278&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Wednesday, October 31</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl19_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4279&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Saturday, November 17</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl20_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4280&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1" colspan="3">
            <hr>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1">Address 789
                <br><a target="_blank" class="gridDirections" href="/Classes/Directions.aspx#223">Directions</a></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Saturday, August 4</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl22_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4347&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Saturday, August 18</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl23_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4305&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1">
            <div class="ClassGridBox1"></div>
        </td>
        <td class="ClassGridRow2">
            <div class="ClassGridBox2">Thursday, September 20</div>
        </td>
        <td class="ClassGridRow3">
            <div class="ClassGridBox3"><a id="_ctl0_phMainContent_dgrdClasses__ctl24_hplAddToCart" class="whitelight" href="/validate.aspx?ClassID1=4332&amp;ClassID2=0">Book Now</a></div>
        </td>
    </tr>
    <tr>
        <td class="ClassGridRow1" colspan="3">
            <hr>
        </td>
    </tr>
</tbody>

并且我尝试返回 ClassGridRow1 ClassGridRow2 ClassGridBox3 的值,如果 ClassGridRow1 包含文本字符串

  

“地址123”

例如。到目前为止,除了上下文节点的内容之外,我没有其他任何可取的东西。谁能帮忙吗?非常感谢!

1 个答案:

答案 0 :(得分:0)

如果您具有所有可用的XPath功能,则可以选择<div class="ClassGridBox1">节点,并用regex fn:replace处理text()

//tbody/tr/td/div[@class="ClassGridBox1"]/[replace(text(),'(^[a-zA-Z.-]+ [0-9]+).*','$1', 's')]

Demo

或者通过一些后期文字处理来放松一下:

//tbody/tr/td/div[@class="ClassGridBox1"]/text()