从href

时间:2015-09-29 13:33:38

标签: python selenium web-scraping beautifulsoup

我试图获取DFS的邮政编码,因为我尝试获取每个商店的href,然后点击它,下一页有商店位置,我可以从中获得邮政编码,但我能够得到工作的事情,我哪里错了? 我尝试首先获得上级属性td.searchResults,然后为每个我尝试点击href with title DFS并点击获取postalCode后。最终迭代所有三个页面。  如果有更好的方法,请告诉我。

 driver = webdriver.Firefox()
    driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
    html = driver.page_source
    soup = BeautifulSoup(html)
    listings = soup.select('td.searchResults')
    for l in listings:
         while True:      
              driver.find_element_by_css_selector("a[title*='DFS']").click()
              shops= {}
              #info = soup.find('span', itemprop='postalCode').contents
              html = driver.page_source
              soup = BeautifulSoup(html)
              info = soup.find(itemprop="postalCode").get_text()
              shops.append(info)

更新

driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')

for l in listings:
    driver.find_element_by_css_selector("a[title*='DFS']").click()
    shops = []
    html = driver.page_source
    soup = BeautifulSoup(html)
    info = soup.find_all('span', attrs={"itemprop": "postalCode"})
    for m in info:
        if m:
           m_text = m.get_text()
           shops.append(m_text)
    print (shops)

2 个答案:

答案 0 :(得分:1)

所以在玩了一会儿之后,我认为最好的办法就是用硒。它需要使用driver.back()并等待元素重新出现,以及其他一些东西。我只使用requestsrebs4就可以获得您想要的内容。 re包含在Python标准库中,如果您尚未安装requests,则可以使用pip执行此操作,如下所示:pip install requests

from bs4 import BeautifulSoup
import re
import requests

base_url = 'http://www.localstore.co.uk'
url = 'http://www.localstore.co.uk/stores/75061/dfs/'
res = requests.get(url)
soup = BeautifulSoup(res.text)

shops = []

links = soup.find_all('a', href=re.compile('.*\/store\/.*'))

for l in links:
    full_link = base_url + l['href']
    town = l['title'].split(',')[1].strip()
    res = requests.get(full_link)
    soup = BeautifulSoup(res.text)
    info = soup.find('span', attrs={"itemprop": "postalCode"})
    postalcode = info.text
    shops.append(dict(town_name=town, postal_code=postalcode))

print shops

答案 1 :(得分:0)

您的代码存在一些问题。您正在使用无限循环而不会破坏条件。 shops= {}也是dict,但您使用的是append方法。 您可以使用python-requestsurllib2,而不是selenium

但是在你的代码中你可以做这样的事情,

driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')

for l in listings:
    driver.find_element_by_css_selector("a[title*='DFS']").click()
    shops = []
    html = driver.page_source
    soup = BeautifulSoup(html)
    info = soup.find('span', attrs={"itemprop": "postalCode"})
    if info:
        info_text = info.get_text()
        shops.append(info_text)
    print shops

在Beautifulsoup中,你可以通过它找到一个标签,如下所示:

soup.find('span', attrs={"itemprop": "postalCode"})

如果它找不到任何内容,它将返回None并且.get_text()方法会引发AttributeError。因此,在应用.get_text()

之前先检查一下