无法从某些元素正确解析名称

时间:2017-11-08 11:51:44

标签: python string python-3.x web-scraping css-selectors

我在python中编写了一个脚本来解析某些元素中的一些名称。当我执行我的脚本时,它会解析名称,但输出很奇怪。正在以这样的方式解析名称,使其看起来像两个大牌。名称由br标签分隔。我怎样才能单独获得每个名字?

名称所在的元素:

html_content='''
<div class="second-child"><div class="richText"> <p></p>
<p><strong>D<br></strong>Daiwa House Industry<br>Danske Bank<br>DaVita HealthCare Partners<br>Delphi Automotive<br>Denso<br>Dentsply International<br>Deutsche Boerse<br>Deutsche Post<br>Deutsche Telekom<br>Diageo<br>Dialight<br>Digital Realty Trust<br>Donaldson Company<br>DSM<br>DS Smith </p>
<p><strong>E<br></strong>East Japan Railway Company<br>eBay<br>EDP Renováveis<br>Edwards Lifesciences<br>Elekta<br>EnerNOC<br>Enphase Energy<br>Essilor<br>Etsy<br>Eurazeo<br>European Investment Bank (EIB)<br>Evonik Industries<br>Express Scripts&nbsp;<br><br><strong>F<br></strong>Fielmann<br>First Solar<br>FMO<br>Ford Motor<br>Fresenius Medical Care<br><br></p></div></div>
'''

我用来解析名称的脚本:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content,"lxml")
for items in soup.select(".second-child"):
    name = ' '.join([item.text for item in items.select("p")])
    print(name)

输出我有(部分结果):

DDaiwa House IndustryDanske BankDaVita HealthCare PartnersDelphi AutomotiveDensoDentsply InternationalDeutsche

输出我想得到:

DDaiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International

仅供参考,当我仔细研究结果时,我发现每个单独的名字彼此相连,两者之间没有间隙。

2 个答案:

答案 0 :(得分:1)

使用<property name="hibernate.default_schema" value="username"/> 删除所有代码,您需要将item.text代码替换为<br>。使用Ian Mackinnon为问题提供的答案:Convert </br> to end line

你的脚本应该是:

'\n'

和输出:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content,"lxml")

for br in soup.find_all("br"):
    br.replace_with("\n")

for items in soup.select(".second-child"):
    name = ' '.join([item.text for item in items.select("p")])
    print(name)

答案 1 :(得分:1)

检查以下解决方案并告诉我是否需要进行一些改进:

for items in soup.select(".second-child"):
    for text_nodes in items.select("p"):
        name = " \n".join([item for item in text_nodes.strings if item])
        print(name)

输出

D 
Daiwa House Industry 
Danske Bank 
DaVita HealthCare Partners 
Delphi Automotive 
Denso 
Dentsply International 
Deutsche Boerse 
Deutsche Post 
Deutsche Telekom 
Diageo 
Dialight 
Digital Realty Trust 
Donaldson Company 
DSM 
DS Smith 
E 
East Japan Railway Company 
eBay 
EDP Renováveis 
Edwards Lifesciences 
Elekta 
EnerNOC 
Enphase Energy 
Essilor 
Etsy 
Eurazeo 
European Investment Bank (EIB) 
Evonik Industries 
Express Scripts  
F 
Fielmann 
First Solar 
FMO 
Ford Motor 
Fresenius Medical Care