我正在尝试抓取一个网站,其中的内容位于div内,而各种详细信息位于各个div类内,我可以使用漂亮的汤来美化该内容,但由于我不希望将其保存在数据框中,所以< / p>
这是我尝试过的
unspsc_link = "https://www.besse.com/pages/products-
specialties/productsbyspecialty/urology/eligard"
link = requests.get(unspsc_link).text
soup = BeautifulSoup(link, 'lxml')
div = soup.find('div', {'class': 'prdFormTable'})
# Lists to store the scraped data in
prdTitle = []
prdSubTitle = []
prdDesc = []
prdItemNumber = []
prdNDC = []
prdCode = []
for links in div.find_all('div', {'class': 'prdFormTableRow'}):
if links.find('div', class_ = 'prdTitle') is not None:
name = links.text
prdTitle.append(name)
if links.find('div', class_ = 'prdSubTitle') is not None:
sub = links.text
prdSubTitle.append(sub)
if links.find('div', class_ = 'prdDesc') is not None:
sub = links.text
prdDesc.append(sub)
if links.find('div', class_ = 'prdItemNumber') is not None:
sub = links.text
prdItemNumber.append(sub)
if links.find('div', class_ = 'prdNDC') is not None:
sub = links.text
prdNDC.append(sub)
if links.find('div', class_ = 'prdCode') is not None:
sub = links.text
prdCode.append(sub)
test_df = pd.DataFrame({'prdtitle': prdTitle,
'subTitle': prdSubTitle,
'prdDesc': prdDesc,
'prdItemNumber': prdItemNumber,
'prdNdc': prdNDC,
'prdcode': prdCode
})
它确实将谷歌保存在列表中,但格式不正确
when i print(prdTitle)
['\n\n\n\n\nELIGARD® 7.5mg Kit (1 Month) \nTolmar Pharmaceuticals\nLeuprolide acetate for injectable suppression, 7.5mg every month. ELIGARD is the only LHRH agonist with the innovative ATRIGEL® Delivery System. \n\n\nItem # 44768 \nNDC: 62935-0753-75\nHCPCS CODE: J9217 \n\n\n\n\xa090 Day Terms\r\n \n \r\n Credit Card Next Day Delivery\r\n \r\n \n\nPLACE ORDER\n\n',
'\n\n\n\n\n\u200bELIGARD® 22.5mg Kit (3 Month) \nTolmar Pharmaceuticals\nLeuprolide acetate for injectable suppression, 22.5mg every 3 months. ELIGARD is the only LHRH agonist with the innovative ATRIGEL® Delivery System. \n\n\nItem # 44769 \nNDC: 62935-0223-05\nHCPCS CODE: J9217 \n\n\n\n\xa090 Day Terms\r\n \n \r\n Credit Card Next Day Delivery\r\n \r\n \n\nPLACE ORDER\n\n',
'\n\n\n\n\nELIGARD® 30mg Kit (4 Month) \nTolmar Pharmaceuticals\nLeuprolide acetate for injectable suppression, 30mg every 4 months. ELIGARD is the only LHRH agonist with the innovative ATRIGEL® Delivery System. \n\n\nItem # 44770 \nNDC: \u200b62935-0303-30\nHCPCS CODE: J9217 \n\n\n\n\xa090 Day Terms\r\n \n \r\n Credit Card Next Day Delivery\r\n \r\n \n\nPLACE ORDER\n\n',
'\n\n\n\n\nELIGARD® 45mg Kit (6 Month) \nTolmar Pharmaceuticals\nLeuprolide acetate for injectable suppression, 45mg every 6 months. ELIGARD is the only LHRH agonist with the innovative ATRIGEL® Delivery System. \n\n\nItem # 44771 \nNDC: \u200b62935-0453-45\nHCPCS CODE: J9217 \n\n\n\n\xa090 Day Terms\r\n \n \r\n Credit Card Next Day Delivery\r\n \r\n \n\nPLACE ORDER\n\n']
在prdTitle列表中,它包含所有详细信息,但我只想存储prdTitle,然后再存储具有所需值的其他列表
答案 0 :(得分:1)
您要使用在以下级别找到的内容:
if links.find('div', class_ = 'prdTitle') is not None:
name = links.text
prdTitle.append(name)
以上内容仍适用于links
,而不是links.find
的结果。
使用select_one
,您可以执行以下操作(与find
相同),即将result设置为变量并使用该变量。
还请考虑使用字典将始终附加到列表的效率更高。
prdTitles = []
prdSubTitles = []
prdDescs = []
prdItemNumbers = []
prdNDCs = []
prdCodes = []
for row in soup.select('.prdFormTableRow'):
prdTitle = row.select_one('.prdTitle')
if prdTitle is None:
prdTitles.append('N/A')
else:
prdTitles.append(prdTitle.text.strip().replace('\u200b',''))
prdSubTitle = row.select_one('.prdSubTitle')
if prdSubTitle is None:
prdSubTitles.append('N/A')
else:
prdSubTitles.append(prdSubTitle.text.strip())
prdDesc = row.select_one('.prdDesc')
if prdDesc is None:
prdDescs.append('N/A')
else:
prdDescs.append(prdDesc.text.strip())
prdItemNumber = row.select_one('.prdItemNumber')
if prdItemNumber is None:
prdItemNumbers.append('N/A')
else:
prdItemNumbers.append(prdItemNumber.text.strip())
prdNDC = row.select_one('.prdNDC')
if prdNDC is None:
prdNDCs.append('N/A')
else:
prdNDCs.append(prdNDC.text.strip())
prdCode = row.select_one('.prdCode')
if prdCode is None:
prdCodes.append('N/A')
else:
prdCodes.append(prdCode.text.strip())