使用beautifulsoup有效地解析字符串

时间:2018-01-03 14:29:47

标签: python beautifulsoup html-parsing

我正在尝试解析这个html以获得项目标题(例如Big Boss Air Fryer - 健康的1300瓦超大尺寸16夸脱,油炸机5种颜色-NEW)

<div style="" class="">
    <h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW</h1>
            <h2 id="subTitle" class="it-sttl">
            Brand New + Free Shipping, Satisfaction Guaranteed! </h2>
    <!-- DO NOT change linkToTagId="rwid" as the catalog response has this ID set  -->
    <div class="vi-hdops-three-clmn-fix">           
        <div style="" class="vi-notify-new-bg-wrapper">
                <div class="vi-notify-new-bg-dTop" style=""> </div>
                <div id="vi_notification_new" class="vi-notify-new-bg-dBtm" style="top: -28px;"> 
                    <img src="https://ir.ebaystatic.com/rs/v/tnj4p1myre1mpff12w4j1llndmc.png" width="11" height="12" class="vi-notify-new-img" alt="Popular">
                    <span style="font-weight:bold;">5 sold in last 24 hours</span>
                </div>
            </div>
        </div>      
    </div>

我使用以下代码来解析页面

url1 = "https://www.ebay.com/itm/Big-Boss-Air-Fryer-Healthy-1300-Watt-Super-Sized-16-Quart-Fryer-5-Colors-NEW/122454150244?    epid=2254405949&hash=item1c82d60c64:m:mqfT2XbgveSevmN5MV1iysg"

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)

    for item in soup.findAll('h1', {'class':'it-ttl'}):
        print(item.string) # Use item.text

get_single_item_data(url1)

当我这样做时,beautifulsoup返回'None'。

我找到的一个解决方案是使用print(item.text)代替,但现在我得到了'有关Big Boss Air Fryer的详细信息 - 健康的1300瓦超大尺寸16夸脱,油炸机5种颜色 - 新'(我做不想'详情')。

是否有一种有效的方法来获取项目标题而无需获取文本然后取消“详细信息”?

2 个答案:

答案 0 :(得分:1)

这是因为.string属性的这个警告:

  

如果某个代码包含多个内容,则不清楚.string应引用的内容,因此.string定义为None

由于header元素包含多个子元素 - 因此无法定义,默认为None

避免削减&#34;详细信息&#34;另外,您可以以非递归模式获取第一个文本节点:

soup.find('h1', {'class':'it-ttl'}).find(text=True, recursive=False)

演示:

In [3]: soup = BeautifulSoup(data, "html.parser")

In [4]: print(soup.find('h1', {'class':'it-ttl'}).find(text=True, recursive=False))
Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW

答案 1 :(得分:0)

您[sh / co]将使用.text代替.string

from bs4 import BeautifulSoup
import requests


url1 = "https://www.ebay.com/itm/Big-Boss-Air-Fryer-Healthy-1300-Watt-Super-Sized-16-Quart-Fryer-5-Colors-NEW/122454150244?    epid=2254405949&hash=item1c82d60c64:m:mqfT2XbgveSevmN5MV1iysg"

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text,'html.parser')



    for item in soup.findAll('h1', {'class':'it-ttl'}):
        print(item.text) # Use item.text

get_single_item_data(url1)

输出:

Details about   Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW