使用“漂亮的汤”进行网页剪贴并将其保存到数据框

时间:2019-08-25 08:41:07

标签: python-3.x web-scraping beautifulsoup

我正在尝试抓取一个网站,其中的内容位于div内,而各种详细信息位于各个div类内,我可以使用漂亮的汤来美化该内容,但由于我不希望将其保存在数据框中,所以< / p>

这是我尝试过的

     unspsc_link = "https://www.besse.com/pages/products- 
     specialties/productsbyspecialty/urology/eligard"    
    link = requests.get(unspsc_link).text
    soup = BeautifulSoup(link, 'lxml')

    div = soup.find('div', {'class': 'prdFormTable'})

    # Lists to store the scraped data in
    prdTitle      = []
    prdSubTitle   = []
    prdDesc       = []
    prdItemNumber = []
    prdNDC        = []
    prdCode       = []

    for links in div.find_all('div', {'class': 'prdFormTableRow'}):

        if links.find('div', class_ = 'prdTitle') is not None:
            name = links.text
            prdTitle.append(name)

        if links.find('div', class_ = 'prdSubTitle') is not None:
            sub = links.text
            prdSubTitle.append(sub)

        if links.find('div', class_ = 'prdDesc') is not None:
            sub = links.text
            prdDesc.append(sub)

        if links.find('div', class_ = 'prdItemNumber') is not None:
            sub = links.text
            prdItemNumber.append(sub)

        if links.find('div', class_ = 'prdNDC') is not None:
            sub = links.text
            prdNDC.append(sub)

        if links.find('div', class_ = 'prdCode') is not None:
            sub = links.text
            prdCode.append(sub)


    test_df = pd.DataFrame({'prdtitle': prdTitle,
    'subTitle': prdSubTitle,
    'prdDesc': prdDesc,
    'prdItemNumber': prdItemNumber,
    'prdNdc': prdNDC,
    'prdcode': prdCode
    })

它确实将谷歌保存在列表中,但格式不正确

   when i print(prdTitle)  

        ['\n\n\n\n\nELIGARD® 7.5mg Kit (1 Month) \nTolmar Pharmaceuticals\nLeuprolide acetate for injectable suppression, 7.5mg every month. ELIGARD is the only LHRH agonist with the innovative ATRIGEL® Delivery System. \n\n\nItem # 44768 \nNDC: 62935-0753-75\nHCPCS CODE: J9217 \n\n\n\n\xa090 Day Terms\r\n           \n            \r\n           Credit Card   Next Day Delivery\r\n         \r\n         \n\nPLACE ORDER\n\n',
         '\n\n\n\n\n\u200bELIGARD® 22.5mg Kit (3 Month)  \nTolmar Pharmaceuticals\nLeuprolide acetate for injectable suppression, 22.5mg every 3 months. ELIGARD is the only LHRH agonist with the innovative ATRIGEL® Delivery System. \n\n\nItem # 44769 \nNDC: 62935-0223-05\nHCPCS CODE: J9217 \n\n\n\n\xa090 Day Terms\r\n           \n            \r\n           Credit Card   Next Day Delivery\r\n         \r\n         \n\nPLACE ORDER\n\n',
         '\n\n\n\n\nELIGARD® 30mg Kit (4 Month) \nTolmar Pharmaceuticals\nLeuprolide acetate for injectable suppression, 30mg every 4 months. ELIGARD is the only LHRH agonist with the innovative ATRIGEL® Delivery System. \n\n\nItem # 44770 \nNDC:  \u200b62935-0303-30\nHCPCS CODE: J9217 \n\n\n\n\xa090 Day Terms\r\n           \n            \r\n           Credit Card   Next Day Delivery\r\n         \r\n         \n\nPLACE ORDER\n\n',
         '\n\n\n\n\nELIGARD® 45mg Kit (6 Month) \nTolmar Pharmaceuticals\nLeuprolide acetate for injectable suppression, 45mg every 6 months. ELIGARD is the only LHRH agonist with the innovative ATRIGEL® Delivery System. \n\n\nItem # 44771 \nNDC: \u200b62935-0453-45\nHCPCS CODE: J9217 \n\n\n\n\xa090 Day Terms\r\n           \n            \r\n           Credit Card   Next Day Delivery\r\n         \r\n         \n\nPLACE ORDER\n\n']

在prdTitle列表中,它包含所有详细信息,但我只想存储prdTitle,然后再存储具有所需值的其他列表

1 个答案:

答案 0 :(得分:1)

您要使用在以下级别找到的内容:

if links.find('div', class_ = 'prdTitle') is not None:
    name = links.text
    prdTitle.append(name)

以上内容仍适用于links,而不是links.find的结果。

使用select_one,您可以执行以下操作(与find相同),即将result设置为变量并使用该变量。

还请考虑使用字典将始终附加到列表的效率更高。

prdTitles = []
prdSubTitles = []
prdDescs = []
prdItemNumbers = []
prdNDCs = []
prdCodes = []

for row in soup.select('.prdFormTableRow'):
    prdTitle = row.select_one('.prdTitle')
    if prdTitle is None:
        prdTitles.append('N/A')
    else:
        prdTitles.append(prdTitle.text.strip().replace('\u200b',''))

    prdSubTitle = row.select_one('.prdSubTitle')
    if prdSubTitle is None:
        prdSubTitles.append('N/A')
    else:
        prdSubTitles.append(prdSubTitle.text.strip())

    prdDesc = row.select_one('.prdDesc')
    if prdDesc is None:
        prdDescs.append('N/A')
    else:
        prdDescs.append(prdDesc.text.strip())    

    prdItemNumber = row.select_one('.prdItemNumber')
    if prdItemNumber is None:
        prdItemNumbers.append('N/A')
    else:
        prdItemNumbers.append(prdItemNumber.text.strip())

    prdNDC = row.select_one('.prdNDC')
    if prdNDC is None:
        prdNDCs.append('N/A')
    else:
        prdNDCs.append(prdNDC.text.strip())

    prdCode = row.select_one('.prdCode')
    if prdCode is None:
        prdCodes.append('N/A')
    else:
        prdCodes.append(prdCode.text.strip())