如果<a href'?=""

时间:2015-10-12 21:05:06

标签: python list python-2.7 web-scraping beautifulsoup

="" I have data stored in a list like the following:

date_name = [<a href="/president/washington/speeches/speech-3455">Proclamation of Neutrality (April 22, 1793)</a>,
<a class="transcript" href="/president/washington/speeches/speech-3455">Transcript</a>, 
<a href="/president/washington/speeches/speech-3456">Fifth Annual Message to Congress (December 3, 1793)</a>, 
<a class="transcript" href="/president/washington/speeches/speech-3456">Transcript</a>, 
<a href="/president/washington/speeches/speech-3721">Proclamation against Opposition to Execution of Laws and Excise Duties in Western Pennsylvania (August 7, 1794)</a>]

These are not str elements inside date_name. I'm trying to get Proclamation of Neutrality (April 22, 1793), Fifth Annual Message to Congress (December 3, 1793), and Proclamation against Opposition to Execution of Laws and Excise Duties in Western Pennsylvania (August 7, 1794), so that I can get the dates from each of those speeches. I want to do this for 900+ speeches. Here's the code I've been trying, as it worked for a similar problem I had in another list comprehension scenario:

url = 'http://www.millercenter.org/president/speeches'

connection = urllib2.urlopen(url)
html = connection.read()
date_soup = BeautifulSoup(html)
date_name = date_soup.find_all('a')
del date_name[:203]         # delete extraneous html before first link (for obama 4453)

# do something with the following list comprehensions
dater = [tag.get('<a href=') for tag in date_name if tag.get('<a href=') is not None]

# remove all items in list that don't contain '<a href=', as this string is unique
# to the elements in date_name that I want
speeches_dates = [_ for _ in dater if re.search('<a href=',_)]

However, I get an empty set with the dater variable process, so I'm unable to move forward to construct speeches_dates.

1 个答案:

答案 0 :(得分:2)

您看到的是ResultSet - Tag个实例的列表。当您打印Tag时,您将获得HTML字符串表示形式。你需要的是获取文本:

date_name = date_soup.find_all('a')[:203]
print([item.get_text(strip=True) for item in date_name])

另外,根据我的理解,您需要指向演讲的链接(在包含日期的主要内容中)。在这种情况下,您需要使定位器更具体,而不是找到所有a标签:

import urllib2
from bs4 import BeautifulSoup

url = 'http://www.millercenter.org/president/speeches'

date_soup = BeautifulSoup(urllib2.urlopen(url), "lxml")
speeches = date_soup.select('div#listing div.title a[href*=speeches]')

for speech in speeches:
    text = speech.get_text(strip=True)
    print(text)

打印:

Acceptance Speech at the Democratic National Convention (August 28, 2008)
Remarks on Election Night (November 4, 2008)
Inaugural Address (January 20, 2009)
...
Talk to the Cherokee Nation (August 29, 1796)
Farewell Address (September 19, 1796)
Eighth Annual Message to Congress (December 7, 1796)