Question

我最近开始使用python和scrapy。我一直在尝试使用scrapy从电影或演员维基页面开始，保存名称和演员或电影摄影，并遍历演员或电影摄影部分中的链接到其他演员/电影维基页面。

但是，我不知道规则是如何工作的（编辑：确定，这有点夸张）并且wiki链接非常嵌套。我看到你可以通过xpath限制并给id或类，但我想要的大多数链接似乎没有类或id。我也不确定xpath是否还包括其他兄弟姐妹和孩子。

因此，我想了解使用哪些规则来限制不相关的链接，并且仅限于演员和电影摄影链接。

编辑：显然，我应该更好地解释我的问题。它并不是我根本不理解xpaths和规则（因为我感到沮丧，这有点夸张）但我显然不清楚他们的工作。首先，让我展示我到目前为止所做的事情，然后澄清我遇到麻烦的地方。

import logging
from bs4 import BeautifulSoup
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor, re
from scrapy.exceptions import CloseSpider
from Assignment2_0.items import Assignment20Item

logging.basicConfig(filename='spider.log',level = logging.DEBUG)


class WikisoupSpiderSpider(CrawlSpider):
    name = 'wikisoup_spider'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Keira_Knightley']

rules = (
    Rule(LinkExtractor(restrict_css= 'table.wikitable')),
    Rule(LinkExtractor(allow =('(/wiki/)',), ),
              callback='parse_crawl', follow=True))

actor_counter = 0
actor_max = 250
movie_counter = 0
movie_max = 125

def parse_crawl(self, response):
    items = []
    soup = BeautifulSoup(response.text, 'lxml')
    item = Assignment20Item()
    occupations = ['Actress', 'Actor']
    logging.debug(soup.title)

    tempoccu = soup.find('td', class_ = 'role')
    logging.warning('tempoccu only works for pages of people')

    tempdir = soup.find('th', text = 'Directed by')
    logging.warning('tempdir only works for pages of movies')


    if (tempdir is not None) and self.movie_counter < self.movie_max:
        logging.info('Found movie and do not have enough yet')

        item['moviename'] = soup.h1.text
        logging.debug('name is ' + item['moviename'])

        finder = soup.find('th', text='Box office')
        gross = finder.next_sibling.next_sibling.text
        gross_float = re.findall(r"[-+]?\d*\.\d+|\d+", gross)
        item['netgross'] = float(gross_float[0])
        logging.debug('Net gross is ' + gross_float[0])

        finder = soup.find('div', text='Release date')
        date = finder.parent.next_sibling.next_sibling.contents[1].contents[1].contents[1].get_text(" ")
        date = date.replace(u'\xa0', u' ')
        item['releasedate'] = date
        logging.debug('released on ' + item['releasedate'])

        item['type'] = 'movie'
        items.append(item)

    elif (tempoccu is not None) and (any(occu in tempoccu for occu in occupations)) and self.actor_counter < self.actor_max:
        logging.info('Found actor and do not have enough yet')

        item['name'] = soup.h1.text
        logging.debug('name is ' + item['name'])

        temp = soup.find('span', class_ = 'noprint ForceAgeToShow').text
        age = re.findall('\d+', temp)
        item['age'] = int(age[0])
        logging.debug('age is ' + age[0])

        filmo = []
        finder = soup.find('span', id='Filmography')
        for x in finder.parent.next_sibling.next_sibling.find_all('i'):
            filmo.append(x.text)
        item['filmography'] = filmo
        logging.debug('has done ' + filmo[0])

        item['type'] = 'actor'
        items.append(item)

    elif (self.movie_counter == self.movie_max and self.actor_counter == self.actor_max):
        logging.info('Found enough data')

        raise CloseSpider(reason='finished')

    else :
        logging.info('irrelavent data')

        pass

    return items

现在，我对代码中的规则的理解是它应该允许所有wiki链接，并且应该仅从表标记及其子代中获取链接。这显然不是发生的事情，因为它很快就从电影中消失了。

当每个元素都有一个像id或class这样的标识符时，我很清楚要做什么，但在检查页面时，这些链接被隐藏在多个无标记的嵌套中，这些嵌套似乎并不都遵循单一模式（我会使用常规的xpath，但不同的页面有不同的路径到胶片，并且它似乎不像在h2 = filmography下找到表的路径，将包括下面表格中的所有链接）。因此，我想了解更多关于我如何才能让scrapy只使用Filmography链接（无论如何都在演员页面中）。

如果这是一个显而易见的事情，我很抱歉，我已经开始在48小时前使用python和scrapy / xpath / css了。

Answer 1

首先，您需要知道您需要查找的位置，我的意思是，您必须过滤哪些标签，因此您必须检查页面上对应的HMTL代码。关于图书馆，我会使用：

import requests

进行连接

from bs4 import BeautifulSoup as bs

解析器

示例：

bs = bs('file with html code', "html.parser")

您实例化该对象 select_tags = bs（'select'）您要查找要过滤的标签

然后你应该包装你的列表并添加一些这样的条件：

    for i in self.select:
        print i.get('class'), type(i.get('class'))
        if type(i.get('class')) is list and '... name you look for ...' in i.get('class'):

在这种情况下，您可以通过“class”标记在所需的select标记内进行过滤。

Answer 2

如果我理解了您想要的内容，您可能需要将两个规则合并为一个，同时使用allow和restrict_xpath/restrict_css。

所以，比如：

rules = [
    Rule(LinkExtractor(allow=['/wiki/'], restrict_xpaths=['xpath']),
         callback='parse_crawl',
         follow=True)
]

刮痧维基百科通常非常复杂，特别是在尝试访问非常具体的数据时。我在这个特定的例子中看到了一些问题：

数据缺乏结构 - 它只是序列中的一堆文本，这意味着你的xpath将会非常复杂。例如，要选择所需的3个表，您可能需要使用： //table[preceding-sibling::h2[1][contains(., "Filmography")]]
您只想关注Title列（第二个）中的链接，但是，由于HTML表的定义方式，这可能并不总是由行的第二个td表示。这意味着您可能需要一些额外的逻辑，在您的xpath或代码中枯萎。
IMO最大的问题：缺乏一致性。例如，看看https://en.wikipedia.org/wiki/Gerard_Butler#Filmography那里没有表格，只有一个列表和另一篇文章的链接。基本上，您无法保证信息的命名，定位，布局或显示。

这些说明可能会让你开始，但获取这些信息将是一项重大任务。

我的推荐和个人选择是从更专业的来源获取您想要的数据，而不是试图将网站刮成维基百科。

如何使用scrapy规则从Wiki演员和电影页面爬行到仅演员和fimlography链接

2 个答案: