从python中的同一个div获取每个href

时间:2017-12-14 12:45:00

标签: python web-scraping beautifulsoup href

我有这个汤:

enter image description here

网页在网格视图中有公司的引用(16行x 5列),我想检索每个引用的网址和标题。问题是每行中的所有5个引用都在一个名为row的类中,当我抓取页面时,我只能看到每行的第一个引用,而不是所有5个引用。到目前为止,这是我的代码:

url = 'http://www.slimstock.com/nl/referenties/'

r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

info_block = soup.find_all("div", attrs={"class": "row"})

references = pd.DataFrame(columns=['Company Name', 'Web Page'])

for entry in info_block:
    try:

        title = entry.find('img').get('title')
        url = entry.a['href']
        urlcontent = BeautifulSoup(requests.get(url).content, "lxml")

        row = [{'Company Name': title, 'Web Page': url}]
        references = references.append(row, ignore_index=True)  

    except:
        pass 

有没有办法解决这个问题?

2 个答案:

答案 0 :(得分:2)

我认为你应该遍历" img"或者超过" a"。 你可以这样写:

for entry in info_block:
try:
    for a in entry.find_all("a"):
        title = a.find('img').get('title')
        url = a.get('href')
        urlcontent = BeautifulSoup(requests.get(url).content, "lxml")
        row = [{'Company Name': title, 'Web Page': url}]
        references = references.append(row, ignore_index=True)  
except:
    pass 

答案 1 :(得分:1)

import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'http://www.slimstock.com/nl/referenties/'
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
info_block = soup.find_all("div", attrs={"class": "row"})
references = pd.DataFrame(columns=['Company Name', 'Web Page'])

for entry in info_block:
    anchors = entry.find_all("a")
    for a in anchors:
        try:
            title = a.find('img').get('title')
            url = a['href']
            # urlcontent = BeautifulSoup(requests.get(url).content, "lxml")
            row = [{'Company Name': title, 'Web Page': url}]
            references = references.append(row, ignore_index=True)

        except:
            pass