Question

我正在构建一个实用程序，从一堆不同格式的网站中提取类似的数据（标题和日期），而BeautifulSoup非常有帮助。我还没有找到一个很好的方法来存储我正在使用的BeautifulSoup函数，这样我就不必为每个站点构建一个新函数。这是一个例子：

soup = BeautifulSoup(html)
title = soup.find("h4", "title").text    # extract title
date = soup.find('li', 'when').em.text       # extract date

每个站点都将有一组不同的要解析的节点。有数百个站点，为每个站点构建一个独特的功能是愚蠢的。有没有办法将汤.find（'x'）。etc等调用存储在URL旁边的表中，只需在一个函数中应用正确的BeautifulSoup调用？希望这是有道理的。

谢谢！

Answer 1

嗯，假设我理解你的帖子，这会有用吗？

linkInstructions = {
  "url1": {
    "title": lambda n: n.find('h4', 'title').text,
    "date": lambda n: n.find('li', 'when').em.text
  },
  "url2": {
    "title": lambda n: n.find('h3', 'title').text,
    "date": lambda n: n.find('li', 'when').strong.text
  }
  # and so forth
} 

def parseNode(node, url):
  # let 'node' be the result of BeautifulSoup(html)
  # and 'url' be the url of the site    

  result = {}

  for key,func in linkInstructions[url].iteritems():
    result[key] = func(node)

  # would return a dict with the structure {'title': <title>, 'date': <date>}
  return result

编辑：糟糕，枚举不是正确的功能。

存储和重用BeautifulSoup搜索和节点遍历

1 个答案: