Question

对于练习，我一直在使用BeautifulSoup学习Python和网页抓取。我正在寻找一个程序，可以在网站上找到Team页面，并抓取团队成员的名字。以下是一个＆＃34;团队＆＃34;页面如下：http://plasticbank.org/team-speakers/

我已经认识到所有团队网页都有＃34; Team＆＃34;明显更大但并非所有网站都使用标题，因此难以解析它们。我已经使用urllib2加载了URL。我如何通过网站的主页找到一个＆＃34;团队＆＃34;或者真的任何具有特定主题的页面？它与查找联系页面的问题相同，您如何告诉刮刀找到它？

以下是我的代码的完整部分:(这只是加载网站）

    #Pre: url is a string containing the address of a website
#return: A string with the URL formatted to include http://
def ensureurl(url):
    if '//' not in url:
        return "http://" + url
    else:
        return url

#Pre: url is a string containing the address of a website
#return: The HTML code at that URL or an empty string if the URL could not be processed
def read_url(url):
    url = ensureurl(url)
    print url

    try:
        #User agent spoofing to trick sites into thinking the bot is a human.
        #This does not work on all sites.
        hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}
        request = urllib2.Request(url, headers=hdr)
        return urllib2.urlopen(request).read()
    except urllib2.HTTPError, e:
        print e.fp.read()
        return ""

Answer 1

刮刀自己找不到东西 - 你需要描述定义你要找的东西的技术术语，这意味着你必须设置一些规则来定义什么是'团队'页面是。

根据经验，能够使用BeautifulSoup识别您应该能够通过查看HTML来识别它的东西。

在您的特定情况下，这是一项非常重要的任务。也许你可以从寻找'标题'标签开始？如果我是你，那就是我要去的地方。

找一个＆＃34;团队＆＃34;使用BeautifulSoup的页面

1 个答案: