我已经用python编写了一个脚本,以从大量html元素中抓取地址。该地址位于br
个标签中。但是,当我运行脚本时,将得到此[<br/>, <br/>, <br/>, <br/>]
作为输出。
如何获取完整地址?
我要从中收集地址的html元素:
<div class="ACA_TabRow ACA_FLeft">
Mailing
<br/>
1961 MAIN ST #186
<br/>
WATSONVILLE, CA, 95076
<br/>
United States
<br/>
</div>
到目前为止,我已经尝试过:
from bs4 import BeautifulSoup
import re
html = """
<div class="ACA_TabRow ACA_FLeft">
Mailing
<br/>
1961 MAIN ST #186
<br/>
WATSONVILLE, CA, 95076
<br/>
United States
<br/>
</div>
"""
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow").find(string=re.compile("Mailing")).find_next_siblings()
print(items)
答案 0 :(得分:2)
我将继续检查div中的字符串是否以uint8_t
开头
Mailing
输出
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow")
for i,item in enumerate(items.stripped_strings):
if i==0 and not item.startswith('Mailing'):
break
if i!=0:
print(item)
答案 1 :(得分:0)
from bs4 import BeautifulSoup
import re
html = """
<div class="ACA_TabRow ACA_FLeft">
Mailing
<br/>
1961 MAIN ST #186
<br/>
WATSONVILLE, CA, 95076
<br/>
United States
<br/>
</div>
"""
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow")
items_list = items.text.split('\n')
results = [ x.strip() for x in items_list if x.strip() != '' ]
输出:
print (results)
['Mailing', '1961 MAIN ST #186', 'WATSONVILLE, CA, 95076', 'United States']
答案 2 :(得分:0)
看来我找到了更好的解决方案:
from bs4 import BeautifulSoup
import re
html = """
<div class="ACA_TabRow ACA_FLeft">
Mailing
<br/>
1961 MAIN ST #186
<br/>
WATSONVILLE, CA, 95076
<br/>
United States
<br/>
</div>
"""
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow").find(string=re.compile("Mailing")).find_parent()
find_text = ' '.join([item.strip() for item in items.strings])
print(find_text)
输出:
Mailing 1961 MAIN ST #186 WATSONVILLE, CA, 95076 United States