我正在尝试检测包含HTML标签<p><strong class="title"> </strong></p>
以及标签"shared" OR "amenities"
内的某些单词的字符串,并将单词"shared"
附加到出现在后面的所有逗号分隔的子字符串中该标签。有没有简单的方法可以做到这一点?
示例输入:
</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">
示例输出:
swimming pool, barbecue, beach shared, tennis courts shared
答案 0 :(得分:0)
您可以为此使用一些不同的库,常见的选择是Beautiful Soup或lxml。我更喜欢lxml,因为大多数语言都具有与regex类似的实现,因此,我觉得可以从投资中获得更多收益。
from lxml import html
stuff = '</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">'
stuff = html.fromstring(stuff)
ptag = stuff.xpath('//p/*[contains(text(),"AMENITIES") or contains(text(), "SHARED")]//text()')
print(ptag)
答案 1 :(得分:0)
我使用下面的代码来完成这项工作。欢迎任何意见和建议!
from bs4 import BeautifulSoup
html_to_parse = '</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">'
soup = BeautifulSoup(html_to_parse)
html_body = soup('body')[0]
shared_indicator = html_body.find('strong', 'title').get_text()
non_shared_amenities = html_to_parse.split(shared_indicator,1)[0]
non_shared_amenities = (BeautifulSoup(non_shared_amenities, 'html.parser')
.get_text()
.strip()
)
shared_amenities = html_to_parse.split(shared_indicator,1)[1]
shared_amenities_array = (pd.Series(BeautifulSoup(shared_amenities, 'html.parser')
.get_text()
.split(','))
.replace("[^A-Za-z0-9'`]+", " ", regex = True)
.str.strip()
.apply(lambda x: "{}{}".format(x, ' shared'))
)
shared_amenities_tagged = ", ".join(shared_amenities_array)
non_shared_amenities + ', ' + shared_amenities_tagged