Python:从网站上搜集语音数据

时间:2014-07-01 09:39:59

标签: python beautifulsoup

基本上,我想从此链接获取mit romney的所有演讲

http://mittromneycentral.com/speeches/

我知道如何使用BeautifulSoup从上面的链接获取所有网址。

def mywebcrawl(url):
    urls = []
    htmltext = urllib2.urlopen(url).read()
    soup = BeautifulSoup(htmltext)
    #print soup
    for tag in soup.findAll('a', href = True):
        #append url to top level link
        tag['href'] = urlparse.urljoin(url,tag['href'])
        urls.append(tag['href'])
    pprint(urls)

然而,对于每个网址,我无法提取语音(注意我只想要语音,没有不相关的东西)。我想构建一个函数,它将遍历url列表并提取演讲。我使用了soup.find_all('table')soup.find_all('font')但我无法获得理想的结果。他们大多数时候都没能提取整个演讲。

1 个答案:

答案 0 :(得分:0)

这是我使用的策略:

  • 演讲内容包含在<div class="entry-content">
  • 语音包含<p>个没有class属性的标记。 <p>下的其他<div>代码具有class属性。

以下是获取演讲列表并从演讲页面解析演讲的代码:

from BeautifulSoup import BeautifulSoup as BS

def get_list_of_speeches(html):
    soup = BS(html)
    content_div = soup.findAll('div', {"class":"entry-content"})[0]
    speech_links = content_div.findAll('a')
    speeches = []
    for speech in speech_links:
        title = speech.text.encode('utf-8')
        link = speech['href']
        speeches.append( (title, link) )
    return speeches

# speeches.htm is http://mittromneycentral.com/speeches/
speech_html = open('speeches.htm').read()
get_list_of_speeches(speech_html):

def get_speech_text(html):
    soup = BS(html)
    content_div = soup.findAll('div', {"class":"entry-content"})[0]
    content = content_div.findAll('p', {"class":None})
    speech = ''
    for paragraph in content:
        speech += paragraph.text.encode('utf-8') + '\n'
    return speech


# file1.html is http://mittromneycentral.com/speeches/2006-speeches/092206-values-voters-summit-2006 
html = open('file1.htm').read()
print get_speech_text(html)