使用Python选择特定的toggle_link

时间:2014-06-27 14:44:30

标签: python python-2.7 beautifulsoup

最大的目标是找到具体的房屋账单。 使用此代码,我尝试选择链接:/legislation?q=%7B%22congress%22%3A%22113%22%2C%22chamber%22%3A%22House%22%7D以缩小我的搜索范围。

from bs4 import BeautifulSoup
import urllib2

soup = BeautifulSoup(urllib2.urlopen("https://beta.congress.gov/legislation")) 

for link in soup.find_all('a'):
    soup_links = link.get('href') 

import re   

r1 = re.compile(r'/legislation(\?\S+congress\S+chamber\S+House\S+)')
print r1.findall(soup_links)

当我这样做时,我得到一个空列表而不是链接。

这不是我的常规快递,因为以下工作:

r2 = re.compile(r'\S+congress\S+chamber\S+House\S+')
newstring = '/legislation?q=%7B%22congress%22%3A%22113%22%2C%22chamber%22%3A%22House%22%7D'
print r2.findall(newstring)

1 个答案:

答案 0 :(得分:1)

您每次迭代都会为soup_links重新分配一个新值;最后,只分配了 last href属性。

BeautifulSoup可以为您搜索:

soup = BeautifulSoup(urllib2.urlopen("https://beta.congress.gov/legislation")) 

r1 = re.compile(r'/legislation(\?\S+congress\S+chamber\S+House\S+)')
soup_links = [l['href'] for l in soup.find_all('a', href=r1)]
print soup_links

这会产生一个匹配的链接:

>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> soup = BeautifulSoup(urllib2.urlopen("https://beta.congress.gov/legislation")) 
>>> r1 = re.compile(r'/legislation(\?\S+congress\S+chamber\S+House\S+)')
>>> [l['href'] for l in soup.find_all('a', href=r1)]
['/legislation?q=%7B%22congress%22%3A%22113%22%2C%22chamber%22%3A%22House%22%7D']

如果您只希望一个链接匹配,请使用soup.find()代替soup.find_all()

soup = BeautifulSoup(urllib2.urlopen("https://beta.congress.gov/legislation")) 

r1 = re.compile(r'/legislation(\?\S+congress\S+chamber\S+House\S+)')
soup_link = soup.find('a', href=r1)
print soup_link['href']