使用python&替换为逗号美丽的汤

时间:2017-03-13 14:25:20

标签: python html beautifulsoup

enter image description here

Html代码行如上。

我设法从this url

获得了它
import requests
from bs4 import BeautifulSoup as soup

url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results'

baseurl = 'https://www.saa.gov.uk'
session = requests.session()
response = session.get(url)

# content of search page in soup 
html = soup(response.content,"lxml")

Address = LeftBlockData[3].get_text().strip()
print (Address)

然而,它打印得像'29 BOSWALL PARKWAYEDINBURGHEH5 2BR'

文本之间有<br />,其中'no space'替换为<br />

我想在目前有public FSC(string CPS, FSCServerLocator fscLoc) { if (string.IsNullOrWhiteSpace(CPS)) { throw new Exception("No vin provided at initialization"); } //parameter filtering cps = CPS; } 的地方加逗号。

有人可以推荐一种方法吗?

3 个答案:

答案 0 :(得分:5)

您可以在获取节点文本时设置分隔符。

from bs4 import BeautifulSoup

example = """<td rowspan="1">29 BOSWALL PARKWAY<br />EDINBURGH<br />EHS 2BR</td>"""

soup = BeautifulSoup(example, "xml")

print(soup.find("td").get_text(strip=True, separator=','))

输出:

  

29 BOSWALL PARKWAY,EDINBURGH,EHS 2BR

答案 1 :(得分:0)

from bs4 import BeautifulSoup
soup = BeautifulSoup('<td rowspan="1">29 BOSWALL PARKWAY<br />EDINBURGH<br />EHS 2BR</td>')
text_list = soup.td.contents
text_you_want = '{}, {}, {}'.format(text_list[0], text_list[2], text_list[4])

// similarly

text_you_want = ", ".join([item for item in text_list if isinstance(item, str)])

答案 2 :(得分:0)

我想到了可能的方法:

  1. 使用.get_text(separator=" ")。部件将由空格分隔。
  2. 在BeautifulSoup4中,您可以使用strings" ".join(LeftBlockData[3].strings)