使用漂亮的汤在div标签下进行网页抓取

时间:2019-08-26 18:12:19

标签: python html web-scraping beautifulsoup

我正在尝试删除一个网站,在该网站中我尝试过的各种div标签中都包含详细信息,但是以某种方式我无法进行剪贴,因为div标签中存在每个元素,而且在div下,我也有span标签编写返回空字符串的代码

这是我的代码

    unspsc_link = "https://order.besse.com/Orders/Search/ProductSearch?query=34431"    
    link = requests.get(unspsc_link).text
    soup = BeautifulSoup(link, 'lxml')

    prdItemNumbers = []
    prdTitles = []
    prdSubTitles = []
    prdNDCs = []
    prdUOM = []
    prdForm = []


    for row in soup.select('.row'):
        prdItemNumbers = row.select_one('.font-xs bg-teal')
        if prdItemNumbers is None:
            prdItemNumbers.append('N/A')
        else:
            prdItemNumbers.append(prdItemNumbers.text.strip().replace('\u200b',''))

        prdTitles = row.select_one('.header1')
        if prdTitles is None:
            prdTitles.append('N/A')
        else:
            prdTitles.append(prdTitles.text.strip())

        prdSubTitles = row.select_one('.header2')
        if prdSubTitles is None:
            prdSubTitles.append('N/A')
        else:
            prdSubTitles.append(prdSubTitles.text.strip())    

        prdNDCs = row.select_one('.col-sm-5')
        if prdNDCs is None:
            prdNDCs.append('N/A')
        else:
            prdNDCs.append(prdNDCs.text.strip())

        prdUOM = row.select_one('.col-sm-3')
        if prdUOM is None:
            prdUOM.append('N/A')
        else:
            prdUOM.append(prdUOM.text.strip())

        prdForm = row.select_one('.col-sm-4')
        if prdForm is None:
            prdForm.append('N/A')
        else:
            prdForm.append(prdForm.text.strip())

它引发错误

    prdItemNumbers.append('N/A')

   AttributeError: 'NoneType' object has no attribute 'append'

1 个答案:

答案 0 :(得分:1)

for row in soup.select('.row'):
    prdItemNumbers = row.select_one('.font-xs bg-teal')
    if prdItemNumbers is None:
        prdItemNumbers.append('N/A')
    else:
        prdItemNumbers.append(prdItemNumbers.text.strip().replace('\u200b',''))

应该是

for row in soup.select('.list-group-item'):
    prdItemNumber = row.select_one('.font-xs bg-teal')
    if prdItemNumber is None:
        prdItemNumbers.append('N/A')
    else:
        prdItemNumbers.append(prdItemNumber.text.strip().replace('\u200b',''))

测试应该在prdItemNumber上进行,这是当前设置元素的尝试,而不是要添加到列表的元素。其他原则相同;并且您要使所有列表变量名都复数。此外,要循环的父类应为list-group-item

内容也似乎是从XHR POST请求动态加载的。您可以使用selenium加载页面,然后像以前一样使用page_source

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

d = webdriver.Chrome(r'C:\Users\HarrisQ\Documents\chromedriver.exe')
d.get('https://order.besse.com/Orders/Search/ProductSearch?query=34431')
WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".list-group-item")))
soup = BeautifulSoup(d.page_source, 'lxml')
prdItemNumbers = []
prdTitles = []
prdSubTitles = []
prdNDCs = []
prdUOMs = []
prdForms = []

for row in soup.select('.list-group-item'):

    prdItemNumber = row.select_one('.font-xs bg-teal')
    if prdItemNumber is None:
        prdItemNumbers.append('N/A')
    else:
        prdItemNumbers.append(prdItemNumber.text.strip().replace('\u200b',''))

    prdTitle = row.select_one('.header1')
    if prdTitle is None:
        prdTitles.append('N/A')
    else:
        prdTitles.append(prdTitle.text.strip())

    prdSubTitle = row.select_one('.header2')
    if prdSubTitle is None:
        prdSubTitles.append('N/A')
    else:
        prdSubTitles.append(prdSubTitle.text.strip())    

    prdNDC = row.select_one('.col-sm-5')
    if prdNDC is None:
        prdNDCs.append('N/A')
    else:
        prdNDCs.append(prdNDC.text.strip())

    prdUOM = row.select_one('.col-sm-3')
    if prdUOM is None:
        prdUOMs.append('N/A')
    else:
        prdUOMs.append(prdUOM.text.strip())

    prdForm = row.select_one('.col-sm-4')
    if prdForm is None:
        prdForms.append('N/A')
    else:
        prdForms.append(prdForm.text.strip())
d.quit()