Question

我正在编写一个刮刀来提取不同网站的内容。用户输入一个网址，我的刮刀将解析网址并找出它来自哪个来源（它只支持有限的网站）并根据网站的dom结构提取内容。

最简单的方法如下：

extract(soup, url):

  if url in siteA:
    content = soup.find_all('p')[0]
  elif url in siteB:
    content = soup.find_all('p')[3]
  elif url in siteC:
    content = soup.find_all('div', {'id':'ChapterBody'})[0]
  elif url in siteD:
    content = soup.find_all("td", {"class": "content"})[0]

然而，代码是多余的，因为有更多的网站有不同的规则，所以我想压缩代码并使其更容易。这是我尝试的方式：

extract(soup, url):

  support = {
            'siteA': soup.find_all('p')[0]
            'siteB': soup.find_all('p')[3]
            'siteC': soup.find_all('div', {'id':'ChapterBody'})[0]
            'siteD': soup.find_all("td", {"class": "content"})[0]
            }

  if url in support:
    content = support[url]

通过这种方式，我只需要跟踪字典而不是继续附加代码。但是，当我运行代码时，正在执行每个键值对，并且显示索引错误，因为某些站点没有＆＃39; td＆＃39;或者＆＃39; div＆＃39; div使用id＆＃39; chapterbody＆＃39; ，因此当字典中的siteC / D执行时会引发错误。

我想知道在保持代码紧凑的同时解决这个问题的可能方法是什么？

Answer 1

这里发生的事情是，您在提取内容时编写的代码（例如soup.find_all('p')[0]）在创建support时会立即执行，这是有道理的。你要求python将soup.find_all('p')[0]的返回值赋给字典值，它正在这样做......依此类推所有其他条目。

你打算做的是存储一个你准备就绪时可以执行的函数..为此，你可以使用lambda函数：

support = {
    'siteA': lambda s: s.find_all('p')[0],
    'siteB': lambda s: s.find_all('p')[3],
}

if url in support:
    content = support[url](soup)

但是也许有一天你会有一个网站，其中提取内容的代码更复杂，并且它不能用lambda函数表示（它只支持一个表达式）。所以在这种情况下你可以使用嵌套函数：

def site_complicated(s):
    # this is not complicated.. but it could be...
    return s.find_all('p')[0]

support = {
    'siteA': lambda s: s.find_all('p')[0],
    'siteB': lambda s: s.find_all('p')[3],
    'siteComplicated': site_complicated,
}

Answer 2

将字典转换为函数的字典：

support = {
          'siteA': lambda: soup.find_all('p')[0],
          'siteB': lambda: soup.find_all('p')[3],
          'siteC': lambda: soup.find_all('div', {'id':'ChapterBody'})[0],
          'siteD': lambda: soup.find_all("td", {"class": "content"})[0]
          }

现在他们不会执行，直到你调用该函数：

if url in support:
    content = support[url]()

或者，拉出soup.find_all()调用并拥有元组字典（param，index）也是一种选择：

support = {
          'siteA': (('p'), 0),
          'siteB': (('p'), 3),
          'siteC': (('div', {'id':'ChapterBody'}), 0),
          'siteD': (("td", {"class": "content"}), 0)
          }

if url in support:
    param, index = support[url]
    content = soup.findall(*param)[index]

如何在python的字典中放置互斥方法？

2 个答案: