保留用“ <br/>”分隔的多行地址

时间:2019-06-16 12:17:23

标签: python beautifulsoup web-scripting

  • 如何删除地址行之间的多余空白行?我是 使用Beautifulsoup从网页抓取。
  • 我知道<br/>会换行。但是,如果我要使用 替换为空格或strip():几条地址线变为一行。 我该如何保留我仍然有一些地址行,如下面的预期输出所示?

来自html的输入:

<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />

我的代码如下:

if not (item.find('span', class_ = 'c2') is None):
        address = item.find_all('span', class_ = 'c2')
        for a in item.find_all('span', {"class":"c2"}):
            for addr in address:
                print('Before',addr)           
                    if addr.find_all("br"):
                        for a in addr:
                            print('a',a)
                            if '<br/>' in a: 
                                print('a loop',a)

我对班级(c2)的输出如下:

<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />

测试在范围循环中的输出结果如下

Before <span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br/>Karachi - 75640<br/>Pakistan</span>
a 1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),
a <br/>
a Karachi - 75640
a <br/>
a Pakistan      

这会导致我当前的不良输出结果:
1233 / B,LAC II,St。37 / B,Mehmoodabad#6,在联合面包店后面,

Karachi-75640

巴基斯坦

预期的输出结果:
 Mehmoodabad#6(位于联合面包店后面),LAC II,St。37 / B,1233 / B,
 卡拉奇-75640
 巴基斯坦

2 个答案:

答案 0 :(得分:0)

您可以使用标记对象的replace_with()方法:

from bs4 import BeautifulSoup

data = '''<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />'''

soup = BeautifulSoup(data, 'lxml')

for br in soup.select('br'):
    br.replace_with('\n')

print(soup.text.strip())

打印:

1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),
Karachi - 75640
Pakistan

答案 1 :(得分:0)

您可以使用剥离的字符串并加入

from bs4 import BeautifulSoup as bs

html = '''
<span class="c2">1233/B, LAC II, St. 37/B, Mehmoodabad # 6, (Behind United Bakery),<br />Karachi - 75640<br />Pakistan</span><br />
'''

soup = bs(html, 'lxml')
for item in soup.select('.c2'):
    strings = '\n'.join([string for string in item.stripped_strings])
    print(strings)