Python / BeautifulSoup:检索'href'属性

时间:2016-11-22 15:51:45

标签: python-2.7 web-scraping beautifulsoup

我正在尝试从我正在抓取的网站获取href属性。我的剧本:

from bs4 import BeautifulSoup
import requests
import csv


i = 1
for i in range(1, 2, 1):
   i = str(i)
   baseurl = "https://www.quandoo.nl/amsterdam?page=" + i
   r1 = requests.get(baseurl)
   data = r1.text
   soup = BeautifulSoup(data, "html.parser")
   for link in soup.findAll('span', {'class', "merchant-title", 'itemprop', "name", 'a'}):
       print link

返回以下内容:

<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/ristorante-due-napoletani-5644" itemprop="url">Ristorante Due Napoletani</a></span>
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/yamyam-4850" itemprop="url">YamYam</a></span>
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/the-golden-temple-5278" itemprop="url">The Golden Temple</a></span>
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/sampurna-4609" itemprop="url">Sampurna</a></span>
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/motto-sushi-25471" itemprop="url">Motto Sushi</a></span>
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/takumi-ya-8171" itemprop="url">Takumi-Ya</a></span>
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/casa-di-david-19167" itemprop="url">Casa di David</a></span>

(这只是其中的一部分。我不想用整个输出轰炸你。)我没有问题拉出餐馆名称的字符串,但我找不到配置给我href属性。对于我当前的配置,.strip()方法似乎不可行。任何帮助都会很棒。

1 个答案:

答案 0 :(得分:1)

尝试使用此代码,它适用于我:

from bs4 import BeautifulSoup
import requests
import csv

import re


i = 1
for i in range(1, 2, 1):
   i = str(i)
   baseurl = "https://www.quandoo.nl/amsterdam?page=" + i
   r1 = requests.get(baseurl)
   data = r1.text
   soup = BeautifulSoup(data, "html.parser")
   for link in soup.findAll('span', {'class', "merchant-title", 'itemprop', "name", 'a'}):
       match = re.search(r'href=[\'"]?([^\'" >]+)', str(link)).group(0)
       print match