我正在尝试从页面中获取联系信息。我需要姓名,职务,电话和电子邮件地址。
我正在学习Python,并尝试针对我所知道的数据编写代码。我可以拉出带有各个联系人的div块,但是我不确定一旦获得它们后如何爬行它们。
tags = soup.find_all('div', attrs={'class':'tshowcase-inner-box'})
但是后来我想在儿童div上爬行,没有运气。
fullname = soup.find('div', attrs={'class':'tshowcase-box-title'})
title = soup('div', attrs={'class':'tshowcase-single-position'})
phone = soup('div', attrs={'class':'tshowcase-single-telephone'})
email = soup('div', attrs={'class':'tshowcase-box-social'})
我不确定接下来要做什么,并感谢任何提示。
以下是示例HTML:
<div class="tshowcase-inner-box ts-float-left ">
<div class="tshowcase-box-info ts-align-left ">
<div class="tshowcase-box-title">FULL NAME</div>
<div class="tshowcase-box-details">
<div class="tshowcase-single-position"><i class="fa fa-chevron-circle-right"></i>JOB TITLE</div>
<div class="tshowcase-single-telephone"><i class="fa fa-phone-square"></i><a href="tel:PHONE">PHONE</a></div>
</div>
<div class="tshowcase-box-social"><a href="mailto:EMAIL" rel="nofollow" target="_blank"><i class="fa fa-envelope-o fa-lg"></i></a></div>
</div>
</div>
答案 0 :(得分:0)
您可以使用soup.find_all
找到元素,然后访问text
和href
值:
from bs4 import BeautifulSoup as soup
import re
d = soup(html, 'html.parser')
s = [i.text for i in d.find_all('div', {'class':re.compile('title$|position$|telephone$')})]
result = [*s, d.find('div', {'class':'tshowcase-box-social'}).a['href'][7:]]
输出:
['FULL NAME', 'JOB TITLE', 'PHONE', 'EMAIL']
如果您尝试在页面上抓取多个联系人块,则可以将上面的代码转换为一个接受bs4
对象的函数以抓取单个列表并遍历所有块div
s:
def get_contact(d):
s = [i.text for i in d.find_all('div', {'class':re.compile('title$|position$|telephone$')})]
return [*s, d.find('div', {'class':'tshowcase-box-social'}).a['href'][7:]]
results = [get_contact(i) for i in soup(html, 'html.parser').find_all('div', {'class':'tshowcase-inner-box'})]
输出:
[['FULL NAME', 'JOB TITLE', 'PHONE', 'EMAIL']]
答案 1 :(得分:0)
我获取联系信息的版本:
data = '''<div class="tshowcase-inner-box ts-float-left ">
<div class="tshowcase-box-info ts-align-left ">
<div class="tshowcase-box-title">FULL NAME</div>
<div class="tshowcase-box-details">
<div class="tshowcase-single-position"><i class="fa fa-chevron-circle-right"></i>JOB TITLE</div>
<div class="tshowcase-single-telephone"><i class="fa fa-phone-square"></i><a href="tel:PHONE">PHONE</a></div>
</div>
<div class="tshowcase-box-social"><a href="mailto:EMAIL" rel="nofollow" target="_blank"><i class="fa fa-envelope-o fa-lg"></i></a></div>
</div>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
data = []
for div in soup.select('.tshowcase-inner-box'):
data.append([])
data[-1].extend(txt.strip() for txt in div.get_text(separator='|').split('|') if txt.strip())
data[-1].extend(a['href'].replace('mailto:', '') for a in div.select('a[href*="mailto:"]'))
print(data)
打印:
[['FULL NAME', 'JOB TITLE', 'PHONE', 'EMAIL']]
答案 2 :(得分:0)
如果您遍历每个列表,则可以测试是否存在并采取相应行动
from bs4 import BeautifulSoup as bs
import requests
html = '''
<div class="tshowcase-inner-box ts-float-left ">
<div class="tshowcase-box-info ts-align-left ">
<div class="tshowcase-box-title">FULL NAME</div>
<div class="tshowcase-box-details">
<div class="tshowcase-single-position"><i class="fa fa-chevron-circle-right"></i>JOB TITLE</div>
<div class="tshowcase-single-telephone"><i class="fa fa-phone-square"></i><a href="tel:PHONE">PHONE</a></div>
</div>
<div class="tshowcase-box-social"><a href="mailto:EMAIL" rel="nofollow" target="_blank"><i class="fa fa-envelope-o fa-lg"></i></a></div>
</div>
</div>
<div class="tshowcase-inner-box ts-float-left ">
<div class="tshowcase-box-info ts-align-left ">
<div class="tshowcase-box-title">FULL NAME2</div>
<div class="tshowcase-box-details">
<div class="tshowcase-single-position"><i class="fa fa-chevron-circle-right"></i>JOB TITLE2</div>
<div class="tshowcase-single-telephone"><i class="fa fa-phone-square"></i><a href="tel:PHONE">PHONE2</a></div>
</div>
<div class="tshowcase-box-social"><a href="mailto:EMAIL2" rel="nofollow" target="_blank"><i class="fa fa-envelope-o fa-lg"></i></a></div>
</div>
</div>
'''
soup = bs(html, 'lxml')
results = []
for listing in soup.select('.tshowcase-inner-box'):
name = listing.select_one('.tshowcase-box-title')
job = listing.select_one('.tshowcase-single-position')
tel = listing.select_one('.tshowcase-single-telephone')
email = listing.select_one('[href^=mailto]')
if name is None:
name = 'Not present'
else:
name = name.text
if job is None:
job = 'Not present'
else:
job = job.text
if tel is None:
tel = 'Not present'
else:
tel = tel.text
if email is None:
email = 'Not present'
else:
email = email['href'].replace('mailto:','')
results.append({ 'name' : name, 'job' : job, 'tel': tel, 'email': email })
print(results)