Question

我有这个汤：

网页在网格视图中有公司的引用（16行x 5列），我想检索每个引用的网址和标题。问题是每行中的所有5个引用都在一个名为row的类中，当我抓取页面时，我只能看到每行的第一个引用，而不是所有5个引用。到目前为止，这是我的代码：

url = 'http://www.slimstock.com/nl/referenties/'

r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

info_block = soup.find_all("div", attrs={"class": "row"})

references = pd.DataFrame(columns=['Company Name', 'Web Page'])

for entry in info_block:
    try:

        title = entry.find('img').get('title')
        url = entry.a['href']
        urlcontent = BeautifulSoup(requests.get(url).content, "lxml")

        row = [{'Company Name': title, 'Web Page': url}]
        references = references.append(row, ignore_index=True)  

    except:
        pass

有没有办法解决这个问题？

Answer 1

我认为你应该遍历＆＃34; img＆＃34;或者超过＆＃34; a＆＃34;。你可以这样写：

for entry in info_block:
try:
    for a in entry.find_all("a"):
        title = a.find('img').get('title')
        url = a.get('href')
        urlcontent = BeautifulSoup(requests.get(url).content, "lxml")
        row = [{'Company Name': title, 'Web Page': url}]
        references = references.append(row, ignore_index=True)  
except:
    pass

Answer 2

import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'http://www.slimstock.com/nl/referenties/'
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
info_block = soup.find_all("div", attrs={"class": "row"})
references = pd.DataFrame(columns=['Company Name', 'Web Page'])

for entry in info_block:
    anchors = entry.find_all("a")
    for a in anchors:
        try:
            title = a.find('img').get('title')
            url = a['href']
            # urlcontent = BeautifulSoup(requests.get(url).content, "lxml")
            row = [{'Company Name': title, 'Web Page': url}]
            references = references.append(row, ignore_index=True)

        except:
            pass

从python中的同一个div获取每个href

2 个答案: