如何使用BeautifulSoup在选定块内获取一个href链接

时间:2019-07-15 21:52:10

标签: python html beautifulsoup screen-scraping

我正在尝试使用BeautifulSoup(Python 3.7)在块内选择特定链接。如何在所选块中选择特定链接?

这是我目前正在做的工作,我以前使用过硒,但我认为还没有必要。

 from bs4 import BeautifulSoup
 import requests

 base_url = 'http://www.shop.pr'

 shop_urls = {'econo' : '/econo/shoppers' , 
              'pueblo' : '/pueblo/shoppers' , 
              'costco' : '/costco/shoppers' , 
              'econo' : '/econo/shoppers'}

 selected_shop = 'econo'
 append_to_url = shop_urls.get(selected_shop)

 url = base_url + append_to_url

 page = requests.get(url)

 soup = BeautifulSoup(page.text , 'html.parser')

 toString = str(soup.prettify)

 file = open('page.txt','w+')
 file.write(toString)

 wrapper = soup.find("div", {"class": "wrapper"})
 sub_wrapper = wrapper.find('div' , {'class' : 'breadcrumb-holder' })

 print(sub_wrapper)

深入研究代码之后,我明白了:

<div class="breadcrumb-holder">
<div data-react-class="SliderPageLink" data-react-

props='{"baseLink":"/econo/shoppers/donde-mejor-se-compra-20190711/4878/product-list-view","page":1,"linkText":"VER PRODUCTOS","sliderSelector":"#shopper-terminal .catalog-view .slider","show":true,"back":false}'></div>
<ul class="breadcrumb">
<li>
<a href="/">Shoppers</a>
</li>
<li>
<a href="/econo/shoppers?clientid=1"><strong>Econo</strong>
</a></li>
</ul>
</div>

,后来尝试获得:     "/econo/shoppers/donde-mejor-se-compra-20190711/4878/product-list-view",但返回“无”。

2 个答案:

答案 0 :(得分:0)

您尝试获取的

TextBox似乎是有效的python字典。如果是这样,我建议您使用data-react-props将其转换为字典,然后获取所需的内容。

import ast 
# Your code here
drp = wrapper.find('div' , {'data-react-class': 'SliderPageLink'})['data-react-props']
drp_dict = ast.literal_eval(drp.replace(':true', ':True').replace(':false', ':False'))
base_link = drp_dict['baseLink'] # Your link here

使用ast.literal_eval似乎很安全,如其文档所述

Help on function literal_eval in module ast:

literal_eval(node_or_string)
    Safely evaluate an expression node or a string containing a Python
    expression.  The string or node provided may only consist of the following
    Python literal structures: strings, numbers, tuples, lists, dicts, booleans,
    and None.

但是,可能需要对字符串进行一些更改,例如ast.literal_eval不是python表达式。

答案 1 :(得分:0)

如果我正确理解了您要寻找的东西,那么应该可以:

首先

import json

然后,将以下内容添加到代码的wrapper部分:

target = sub_wrapper.find('div')
td = json.loads(target['data-react-props'])
print(td['baseLink'])

输出:

  

'/ econo / shoppers / donde-mejor-se-compra-20190711 / 4878 / product-list-view'