我想从URL中提取此护理之家的个人资料信息:该信息在网站上以以下格式提供:https://www.carehome.co.uk/carehome.cfm/searchazref/10001005FITA
组:Excelcare Holdings
负责人:Denise Marks(注册经理)
地方政府/社会服务局:塔姆哈姆雷特议会伦敦自治市镇(单击获取详细联系方式)
等
我的get_deets函数仅输出各自列表“ tag”和“ sibling”中的前一个元素。我还需要标签文本的完整列表以及相应的信息。
脚本
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup as soup
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\Users\Main\Documents\Work\Projects\chromedriver')
my_url = "https://www.carehome.co.uk/carehome.cfm/searchazref/10001005FITA"
def make_soup(url):
driver.get(url)
m_soup = soup(driver.page_source, features='html.parser')
return m_soup
main_page = make_soup(my_url)
strongs = main_page.select(".blue")
def get_deets(strongs):
tag = []
sibling = []
for strong_tag in strongs:
if strong_tag.next_sibling == '\n':
tag.append(strong_tag.text), sibling.append(strong_tag.next_sibling.next_sibling.text)
else:
tag.append(strong_tag.text), sibling.append(strong_tag.next_sibling.strip())
return tag, sibling
我的当前输出:
get_deets(strongs)
(['Group:'], ['Excelcare Holdings'])
所需的输出
标签
['Group:','Person in charge:', 'Local Authority / Social Services:']
兄弟姐妹
['Excelcare Holdings', 'Denise Marks (Registered Manager)','London Borough of Tower Hamlets Council (click for contact details)' ]
使用此HTML:
<div class="profile-group-description col-xs-12 col-sm-8">
<p><strong class="blue">Group:</strong>
<a href="https://www.carehome.co.uk/care_search_results.cfm/searchgroup/36151505EXCA">Excelcare Holdings</a>
</p>
<p><strong class="blue">Person in charge:</strong>
Denise Marks (Registered Manager)</p>
<p><strong class="blue">Local Authority / Social Services:</strong>
London Borough of Tower Hamlets Council (<a href="https://www.carehome.co.uk/local-authorities/profile.cfm/id/Tower-Hamlets">click for contact details</a>)</p>
<p>
<strong class="blue">Type of Service:</strong>
Care Home only (Residential Care) – Privately Owned , Registered for a maximum of 44 Service Users
</p>
<p>
<strong class="blue">Registered Care Categories*:</strong>
Dementia • Learning Disability • Mental Health Condition • Old Age
</p>
答案 0 :(得分:0)
考虑到您的问题中的HTML,可以将其简化一些:
care = """[your HTML]"""
from bs4 import BeautifulSoup as bs
soup = bs(care, 'lxml')
headers = []
rows = []
tags = soup.select('p')
for tag in tags:
items = tag.text.replace('\n','').split('\n')[0].split(':')
headers.append(items[0].strip())
rows.append(items[1].strip())
for h,r in zip(headers,rows):
print(h,': ',r)
输出:
Group : Excelcare Holdings
Person in charge : Denise Marks (Registered Manager)
Local Authority / Social Services : London Borough of Tower Hamlets Council (click for contact details)
Type of Service : Care Home only (Residential Care) – Privately Owned , Registered for a maximum of 44 Service Users
Registered Care Categories* : Dementia • Learning Disability • Mental Health Condition • Old Age