如何使用BeautifulSoup从嵌套在<ul>中的<li>中的<span>中提取文本?

时间:2019-08-30 11:22:35

标签: python html web-scraping beautifulsoup

我想从this page中提取这是新内容部分的内容,从在接下来的几周开始,以总体增强功能结尾。

检查代码,我看到<span>嵌套在<li>下,然后嵌套在<ul id="GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B">下。最近几天,我尝试使用Python 3和BeautifulSoup提取该文件,但无济于事。我正在粘贴下面尝试的代码。

有人会善良地引导我朝正确的方向发展吗?

1#

from urllib.request import urlopen # open URLs 
from bs4 import BeautifulSoup # BS

import sys # sys.exit() 

page_url = 'https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS'

try: 
    page = urlopen(page_url)
except: 
    sys.exit("No internet connection. Program exiting...")

soup = BeautifulSoup(page, 'html.parser')

try: 
    for ultag in soup.find_all('ul', {'id': 'GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B'}):
        print(ultag.text)
        for spantag in ultag.find_all('span'):
            print(spantag)
except:
    print("Couldn't get What's new :(")

2#

from urllib.request import urlopen # open URLs 
from bs4 import BeautifulSoup # BS

import sys # sys.exit() 

page_url = 'https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS'

try: 
    page = urlopen(page_url)
except: 
    sys.exit("No internet connection. Program exiting...")

soup = BeautifulSoup(page, 'html.parser')

uls = []
for ul in uls:
    for ul in soup.findAll('ul', {'id': 'GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B'}):
        if soup.find('ul'):
            break
        uls.append(ul)
    print(uls)
    for li in uls:
        print(li.text)

理想的代码应返回:

在接下来的几周内,您只需单击“开始之前”对话框,即可读取自己拥有的物品。

性能改进,错误修复和其他常规增强。

但是两者都不给我任何东西。似乎找不到具有该ID的ul,但如果您print(soup)一切都看起来不错:

<ul id="GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B">
<li>
<span class="a-list-item"><span><strong>Read Now</strong></span>: In the coming weeks, you will be able to read items that you own with a single click from the �Before You Go� dialog.</span></li>

<li>
<span class="a-list-item">Performance improvements, bug fixes, and other general enhancements.<br></li>


</ul>

2 个答案:

答案 0 :(得分:2)

对于bs4 4.7.1+,您可以使用:contains和:has进行隔离

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS')
soup = bs(r.content, 'lxml')
text = [i.text.strip() for i in soup.select('p:has(strong:contains("Here’s what’s new:")), p:has(strong:contains("Here’s what’s new:")) + p + ul li')]
print(text)

enter image description here

当前,您还可以删除:contains

text = [i.text.strip() for i in soup.select('p:has(strong), p:has(strong) + p + ul li')]
print(text)

+是一个CSS相邻兄弟组合器。阅读更多here。引用:

  

相邻的同级组合器

     

+组合器选择相邻的同级。这意味着第二个元素直接跟随   首先,并且他们共享同一个父对象。

     

语法:A + B

     

示例:h2 + p将匹配所有<p> elements that directly follow an <h2>

答案 1 :(得分:0)

首先,页面是动态呈现的,因此您必须使用selenium才能正确获取页面内容。

第二,您可以找到p标签,其中出现了文本这是新功能,最后得到下一个ul标签。

代码如下:

from bs4 import BeautifulSoup as soup
from selenium import webdriver

url = "https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS"

driver = webdriver.Firefox()

page = driver.get(url)

html = soup(driver.page_source, 'html.parser')

for p in html.find_all('p'):
    if p.text and "Here’s what’s new" in p.text:
        ul = p.find_next_sibling('ul')
        for li in ul.find_all('li'):
            print(li.text)

输出:

Read Now: In the coming weeks, you will be able to read items that you own with a single click from the ‘Before You Go’ dialog.

Performance improvements, bug fixes, and other general enhancements.