Question

我正在使用XPath与Scrapy一起从电影网站BoxOfficeMojo.com中删除数据。

作为一般性问题：我想知道如何在一个Xpath字符串中选择一个父节点的某些子节点。

根据我正在抓取数据的电影网页，有时我需要的数据位于不同的子节点，例如是否有链接。我将浏览大约14000部电影，因此这个过程需要自动完成。

以this为例。我需要演员，导演和制片人。

这是导演的Xpath：注意：％s对应于找到该信息的确定索引 - 在动作杰克逊示例director中找到在[1]和actors [2]。

 //div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()

但是，导演上的页面是否存在链接，这将是Xpath：

 //div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()

演员有点棘手，因为列出的后续演员包含 ，可能是/a的孩子或父/font的孩子，所以：

//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()

获取所有演员（除font/br之外的所有演员）。

现在，我相信这里的主要问题是有多个//div[@class="mp_box_content"] - 我所做的一切除了我最终得到其他mp_box_content的一些数字。此外，我添加了许多try:，except:语句，以便获取所有内容（演员，导演，制作人，他们都拥有和没有相关联的链接）。例如，以下是演员的Scrapy代码：

 actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract()
 try:
     second = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
     for n in second:
         actors.append(n)
 except:
     actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()

这是试图掩盖以下事实：第一个演员可能没有与他/她相关的链接以及后续演员，第一个演员可能有与他/她相关的链接，但其余的可能没有。

我感谢您花时间阅读本文以及任何帮助我找到/解决此问题的尝试！如果需要更多信息，请告诉我。

Answer 1

我假设您只对文本内容感兴趣，而不是对演员页面的链接等。

这是一个直接使用lxml.html（以及lxml.etree}的命题

首先，我建议您按td[2]的文字内容选择td[1]个单元格，并使用.//tr[starts-with(td[1], "Director")]/td[2]等字词来说明“导演”或“导演”
其次，使用或不使用测试各种表达式，无论是否有<a>等，都会使代码难以阅读和维护，因为您只对文本内容感兴趣，您也可以使用string(.//tr[starts-with(td[1], "Actor")]/td[2])来获取文字，或者对所选元素使用lxml.html.tostring(e, method="text", encoding=unicode)
对于多个名称的 问题，我的方法通常是修改包含目标内容的lxml树，为 元素添加特殊格式字符“.text或.tail，例如\n，具有lxml个iter()函数之一。这对其他HTML块元素很有用，例如<hr>。

你可能会更好地看到我对一些蜘蛛代码的意思：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import lxml.etree
import lxml.html

MARKER = "|"
def br2nl(tree):
    for element in tree:
        for elem in element.iter("br"):
            elem.text = MARKER

def extract_category_lines(tree):
    if tree is not None and len(tree):
        # modify the tree by adding a MARKER after <br> elements
        br2nl(tree)

        # use lxml's .tostring() to get a unicode string
        # and split lines on the marker we added above
        # so we get lists of actors, producers, directors...
        return lxml.html.tostring(
            tree[0], method="text", encoding=unicode).split(MARKER)

class BoxOfficeMojoSpider(BaseSpider):
    name = "boxofficemojo"
    start_urls = [
        "http://www.boxofficemojo.com/movies/?id=actionjackson.htm",
        "http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",
    ]

    # locate 2nd cell by text content of first cell
    XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')
    def parse(self, response):
        root = lxml.html.fromstring(response.body)

        # locate the "The Players" table
        players = root.xpath('//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table')

        # we have only one table in "players" so the for loop is not really necessary
        for players_table in players:

            directors_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Director")
            actors_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Actor")
            producers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Producer")
            writers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Producer")
            composers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Composer")

            directors = extract_category_lines(directors_cells)
            actors = extract_category_lines(actors_cells)
            producers = extract_category_lines(producers_cells)
            writers = extract_category_lines(writers_cells)
            composers = extract_category_lines(composers_cells)

            print "Directors:", directors
            print "Actors:", actors
            print "Producers:", producers
            print "Writers:", writers
            print "Composers:", composers
            # here you should of course populate scrapy items

代码可以简化，但我希望你能理解。

您当然可以使用HtmlXPathSelector执行类似的操作（例如，使用string() XPath函数），但不修改 的树（如何使用hxs？）它仅适用于您的情况下的非多个名称：

>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Director")]/td[2])').extract()
[u'Craig R. Baxley']
>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()
[u'Carl WeathersCraig T. NelsonSharon Stone']

XPath：选择某些子节点

1 个答案: