Scrapy和XPath - 选择部分之间的链接和链接文本

时间:2017-04-17 19:40:13

标签: xpath scrapy

Scrapy是非常强大的工具,但有时它在XPath方面令人沮丧 从以下html中,我想在Title 1Title 2之间提取链接和链接文本(<b>January 2017</b><b>February 2017</b>等),并按照“部分”对其进行分组。 实际的html。

<!DOCTYPE html>
<html>
   <head>
      <meta charset="UTF-8">
      <title>Scrapy</title>
   </head>
   <body>
      <hr size=1>
      <h2 style="margin-top: 36px; margin-bottom: 24px">
         Abcd efgh for 2017
      </h2>
      Part 1 | 
      Part 2 | 
      Part 3 | 
      Part 4 | 
      <a href="#">A very bold title</a>
      <hr size="1" style="margin-top: 36px; margin-bottom: 24px">
      <a name="part1"></a>
      <h3>Part 1</h3>
      <ul>
      </ul>
      <a name="part2"></a>
      <h3>Part 2</h3>
      <ul>
      </ul>
      <a name="part3"></a>
      <h3>Part 3</h3>
      <ul>
      </ul>
      <a name="part4"></a>
      <h3>Part 4</h3>
      <ul>
      </ul>
      <div style="margin-top: 36px; margin-bottom: 24px">
         <a name="non_rep"></a>
         <h3>Abcd efgh</h3>
      </div>
      <b>January 2017</b>
      <ul>
         <li>
            <b>Part1 1</b>
         </li>
         <ul>
            <li>
               <a href="/cgi-bin/o.pl?file=/a/1.htm">Title 1</a>
            </li>
            <br>
            <li>
               <a href="/cgi-bin/o.pl?file=/a/11.htm">Title 2</a>
            </li>
            <br>
         </ul>
         <li>
            <b>Part1 2</b>
         </li>
         <ul>
            <li>
               <a href="/cgi-bin/o.pl?file=/a/2.htm">Title A</a>
            </li>
            <br>
            <li>
               <a href="/cgi-bin/o.pl?file=/a/22.htm">Title B</a>
            </li>
            <br>
         </ul>
         <li>
            <b>Part1 3</b>
         </li>
         <ul>
            <li>
               <a href="/cgi-bin/o.pl?file=/a/3.htm">Some text 1</a>
            </li>
            <br>
            <li>
               <a href="/cgi-bin/o.pl?file=/a/33.htm">Some Text 2</a>
            </li>
         </ul>
      </ul>
      <b>February 2017</b>
      <ul>
         <li>
            <b>Part1 1</b>
         </li>
         <ul>
            <li>
               <a href="/cgi-bin/o.pl?file=/b/1.htm">Title 1</a>
            </li>
            <br>
            <li>
               <a href="/cgi-bin/o.pl?file=/b/11.htm">Title 2</a>
            </li>
            <br>
         </ul>
         <li>
            <b>Part1 2</b>
         </li>
         <ul>
            <li>
               <a href="/cgi-bin/o.pl?file=/b/2.htm">Title A</a>
            </li>
            <br>
            <li>
               <a href="/cgi-bin/o.pl?file=/b/22.htm">Title B</a>
            </li>
            <br>
         </ul>
         <li>
            <b>Part1 3</b>
         </li>
         <ul>
            <li>
               <a href="/cgi-bin/o.pl?file=/b/3.htm">Some text 1</a>
            </li>
            <br>
            <li>
               <a href="/cgi-bin/o.pl?file=/b/33.htm">Some Text 2</a>
            </li>
         </ul>
      </ul>
      <b>March 2017</b>
      <ul>
         <li>
            <b>Part1 1</b>
         </li>
         <ul>
            <li>
               <a href="/cgi-bin/o.pl?file=/c/1.htm">Title 1</a>
            </li>
            <br>
            <li>
               <a href="/cgi-bin/o.pl?file=/c/11.htm">Title 2</a>
            </li>
            <br>
         </ul>
         <li>
            <b>Part1 2</b>
         </li>
         <ul>
            <li>
               <a href="/cgi-bin/o.pl?file=/c/2.htm">Title A</a>
            </li>
            <br>
            <li>
               <a href="/cgi-bin/o.pl?file=/c/22.htm">Title B</a>
            </li>
            <br>
         </ul>
         <li>
            <b>Part1 3</b>
         </li>
         <ul>
            <li>
               <a href="/cgi-bin/o.pl?file=/c/3.htm">Some text 1</a>
            </li>
            <br>
            <li>
               <a href="/cgi-bin/o.pl?file=/c/33.htm">Some Text 2</a>
            </li>
         </ul>
      </ul>
      <b>April 2017</b>
      ...
      ...
      So on so forth
   </body>
</html>

结果应为:

January 2017
Part1 1
Title: Title 1, link: /cgi-bin/o.pl?file=/a/1.htm 
Title: Title 1, link: /cgi-bin/o.pl?file=/a/1.htm 
Part1 2
Title: Title 1, link: /cgi-bin/o.pl?file=/a/2.htm 
Title: Title 1, link: /cgi-bin/o.pl?file=/a/22.htm 
Part1 3
Title: Title 1, link: /cgi-bin/o.pl?file=/a/3.htm 
Title: Title 1, link: /cgi-bin/o.pl?file=/a/33.htm 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
February 2017
Part1 1
Title: Title 1, link: /cgi-bin/o.pl?file=/b/1.htm 
Title: Title 1, link: /cgi-bin/o.pl?file=/b/1.htm 
Part1 2
Title: Title 1, link: /cgi-bin/o.pl?file=/b/2.htm 
Title: Title 1, link: /cgi-bin/o.pl?file=/b/22.htm 
Part1 3
Title: Title 1, link: /cgi-bin/o.pl?file=/b/3.htm 
Title: Title 1, link: /cgi-bin/o.pl?file=/b/33.htm

我试过//text()[following-sibling::b/text()='January 2017']/following::a[contains(@href, 'cgi-bin')]/text()和类似的法术无济于事。

我应该怎样接近?

1 个答案:

答案 0 :(得分:1)

整个设置有点讨厌,因为树的结构非常扁平。但是,我们可以看到它遵循以下模式:<b>节点,其下方带有<ul>的文字和数据。
因此,我们可以通过一些循环和following-sibling::ul[1] xpath找到我们想要的所有内容。

由于三重循环,它有点难看,但如果你忽略它,那很简单:

# any <b> node that contains 201x (a year)
nodes = response.xpath("//b[re:test(text(),'201\d')]")
for node in nodes:
    # get date node data
    name = node.xpath('text()').extract_first()
    parts = node.xpath('following-sibling::ul[1]//li/b') 
    for part in parts:
        # the same with part node data
        part_name = part.xpath('text()').extract_first()
        links = part.xpath("../following-sibling::ul[1]//a")
        for link in links:
            # finally, we have date, part and link data! Put it together.
            item = dict()
            item['date_name'] = name
            item['part_name'] = part_name
            item['link_name'] = link.xpath('text()').extract_first()
            item['link_url'] = link.xpath('@href').extract_first()
            yield item