Beautifulsoup-收集href链接并创建链接列表

时间:2019-09-27 17:19:43

标签: python web-scraping beautifulsoup

im试图收集枪支列表中的所有链接(在这种情况下为2页),并打印1)长度和2)链接本身。

我收到错误消息: 列表对象没有属性选择

from bs4 import BeautifulSoup
import requests
import csv
import pandas
from pandas import DataFrame
import re
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"

page = 1
all_links = []
url="https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782?page={}"

with requests.Session() as session:
  while True:
    print(url.format(page))
    res=session.get(url.format(page))
    soup=BeautifulSoup(res.content,'html.parser')
    gun_details = soup.select('div.details')
    for link in gun_details.select('a'):
     all_links.append("https://www.gunstar.co.uk" + link['href'])
    if len(soup.select(".nav_next"))==0:
        break
    page += 1

如果我从响应中删除.content,我得到的响应就没有len。

如果我在汤中添加.text,则选择('div.details')的结果与上述类似。

我确定我在一个相当简单的地方出错了,但是似乎看不到它-是否有一个原因,当试图点击html的特定部分时,select和findAll不起作用?

2 个答案:

答案 0 :(得分:2)

您可以通过不同的方式从所有页面获得链接。这是使用发生器实现相同目的的一种方法:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

link = "https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782"
base = "https://www.gunstar.co.uk"

def get_links(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text,'lxml')
    for item in soup.select(".details > a"):
        yield urljoin(base,item['href'])

    next_page = soup.select_one(".gallery_navigation [rel='next']")
    if next_page:
        yield from get_links(next_page['href'])

if __name__ == '__main__':
    list_of_links = [elem for elem in get_links(link)]
    print(list_of_links)

答案 1 :(得分:1)

尝试以下代码。

from bs4 import BeautifulSoup
import requests
import csv
import pandas
from pandas import DataFrame
import re
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"

page = 1

url="https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782?page={}"

with requests.Session() as session:
  while True:
    all_links=[]
    print(url.format(page))
    res=session.get(url.format(page))
    soup=BeautifulSoup(res.content,'html.parser')
    gun_details = soup.select('div.details')
    for link in gun_details:
     all_links.append("https://www.gunstar.co.uk" + link.select_one('a')['href'])
    print(all_links)
    if len(soup.select(".nav_next"))==0:
        break
    page += 1

输出:

https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782?page=1
['https://www.gunstar.co.uk/mauser-m96-lightning-hunter-straight-pull-270-rifles/rifles/1083802', 'https://www.gunstar.co.uk/magtech-586-12-bore-gauge-pump-action/Shotguns/1083784', 'https://www.gunstar.co.uk/merkel-kr1-bolt-action-308-rifles/rifles/1083786', 'https://www.gunstar.co.uk/christensen-arms-r93-carbon-bolt-action-7-mm-rifles/rifles/1083788', 'https://www.gunstar.co.uk/voere-lbw-luxus-bolt-action-308-rifles/rifles/1083792', 'https://www.gunstar.co.uk/voere-2155-bolt-action-243-rifles/rifles/1083797', 'https://www.gunstar.co.uk/voere-2155-2155-synthetic-bolt-action-308-rifles/rifles/1083798', 'https://www.gunstar.co.uk/mauser-m96-lightning-hunter-straight-pull-7-mm-rifles/rifles/1083799', 'https://www.gunstar.co.uk/blaser-lrs2-straight-pull-308-rifles/rifles/1084397', 'https://www.gunstar.co.uk/remington-700-s-s-barrel-only-bolt-action-300-win-mag-rifles/rifles/1084432']
https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782?page=2
['https://www.gunstar.co.uk/pfeiffer-waffen-handy-hunter-sr2-single-shot-300-win-mag-rif/rifles/1084433', 'https://www.gunstar.co.uk/sabatti-10-22-mod-sporter-semi-auto-22-rifles/rifles/1084442', 'https://www.gunstar.co.uk/voere-lbw-m-sniper-rifle-bolt-action-308-rifles/rifles/1084454', 'https://www.gunstar.co.uk/snipersystems-zoom-gun-light-kit-lamping/Accessories/1130763']

获取所有链接的另一种方法。

from bs4 import BeautifulSoup
import requests
import csv
import pandas
from pandas import DataFrame
import re
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"

page = 1
all_links = []
url="https://www.gunstar.co.uk/view-trader/global-rifle-snipersystems/58782?page={}"

with requests.Session() as session:
  while True:

    print(url.format(page))
    res=session.get(url.format(page))
    soup=BeautifulSoup(res.content,'html.parser')
    gun_details = soup.select('div.details > a')
    for link in gun_details:
     all_links.append("https://www.gunstar.co.uk" + link['href'])

    if len(soup.select(".nav_next"))==0:
        break
    page += 1

print(all_links)