网页抓取期间如何在P标签内获取特定数据?

时间:2020-01-12 09:39:53

标签: python html web-scraping beautifulsoup

我正在尝试从网站中抓取数据,该网站的P标签中包含信息。我唯一感兴趣的数据是联系人,该数据位于同一P标签中。我怎样才能只获得所需的数据?

这是网站的ss。我如何将公司的短信发送到电话号码?

Here is the ss of the website. How can i get the text from Company to tel no.?

2 个答案:

答案 0 :(得分:1)

您需要使用正则表达式来解析从BeautifulSoup获得的<P>块:

import re

text_from_p = """
some text
some more
Tel: 0234-234345-45

some more text
"""

match = re.search(r"Tel: (?P<tel>[0-9\- ]*)", text_from_p)
if match:
    print(match.group("tel"))
else:
    print("Tel not found")

您得到:

0234-234345-45

答案 1 :(得分:1)

您可以使用re模块来解析文本。

例如:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://www.forpressrelease.com/forpressrelease/553538/4/china-leading-cabinet-handles-supplier-rochehandle-celebrates-success-of-entering-european-market'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

txt = soup.select_one('.single_page_content').get_text(strip=True, separator='\n')

company = re.findall(r'Company:\s*(.*)', txt)[0]
address = re.findall(r'Address:\s*(.*)', txt)[0]
contact = re.findall(r'Contact:\s*(.*)', txt)[0]
email = re.findall(r'Email:\s*(.*?)\s*(?=\w+:)', txt, flags=re.S)[0]
tel = re.findall(r'Tel:\s*(.*)', txt)[0]
mob = re.findall(r'Mob:\s*(.*)', txt)[0]
url = re.findall(r'Url\s*:\s*-\s*(.*)', txt, flags=re.S)[0]

print('{:<15}: {}'.format('Company', company))
print('{:<15}: {}'.format('Address', address))
print('{:<15}: {}'.format('Contact', contact))
print('{:<15}: {}'.format('Email', email))
print('{:<15}: {}'.format('Tel', tel))
print('{:<15}: {}'.format('Mob', mob))
print('{:<15}: {}'.format('Url', url))

打印:

Company        : Dongguan Roche Industrial Co., Ltd
Address        : No.83, XiZheng 1st Road, Shajiao Community, Humen Town, Dongguan City, Guangdong Province, China 523936
Contact        : Robin Luo
Email          : info@rochehandle.com
Tel            : 0769-89366747
Mob            : +86-13392706499
Url            : https://www.rochehandle.com