无法从某些html元素中提取某些地址

时间:2019-02-01 14:32:41

标签: python python-3.x web-scraping beautifulsoup

我已经用python编写了一个脚本,以从大量html元素中抓取地址。该地址位于br个标签中。但是,当我运行脚本时,将得到此[<br/>, <br/>, <br/>, <br/>]作为输出。

如何获取完整地址?

我要从中收集地址的html元素:

<div class="ACA_TabRow ACA_FLeft">
 Mailing
 <br/>
 1961 MAIN ST #186
 <br/>
 WATSONVILLE, CA, 95076
 <br/>
 United States
 <br/>
</div>

到目前为止,我已经尝试过:

from bs4 import BeautifulSoup
import re

html = """
<div class="ACA_TabRow ACA_FLeft">
 Mailing
 <br/>
 1961 MAIN ST #186
 <br/>
 WATSONVILLE, CA, 95076
 <br/>
 United States
 <br/>
</div>
"""
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow").find(string=re.compile("Mailing")).find_next_siblings()
print(items)

3 个答案:

答案 0 :(得分:2)

我将继续检查div中的字符串是否以uint8_t开头

Mailing

输出

soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow")

for i,item in enumerate(items.stripped_strings):
    if i==0 and not item.startswith('Mailing'):
        break
    if i!=0:
        print(item)

答案 1 :(得分:0)

from bs4 import BeautifulSoup
import re

html = """
<div class="ACA_TabRow ACA_FLeft">
 Mailing
 <br/>
 1961 MAIN ST #186
 <br/>
 WATSONVILLE, CA, 95076
 <br/>
 United States
 <br/>
</div>
"""
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow")

items_list = items.text.split('\n')

results = [ x.strip() for x in items_list if x.strip() != '' ]

输出:

print (results)
['Mailing', '1961 MAIN ST #186', 'WATSONVILLE, CA, 95076', 'United States']

答案 2 :(得分:0)

看来我找到了更好的解决方案:

from bs4 import BeautifulSoup
import re

html = """
<div class="ACA_TabRow ACA_FLeft">
 Mailing
 <br/>
 1961 MAIN ST #186
 <br/>
 WATSONVILLE, CA, 95076
 <br/>
 United States
 <br/>
</div>
"""
soup = BeautifulSoup(html,"lxml")
items = soup.find(class_="ACA_TabRow").find(string=re.compile("Mailing")).find_parent()
find_text = ' '.join([item.strip() for item in items.strings])
print(find_text)

输出:

Mailing 1961 MAIN ST #186 WATSONVILLE, CA, 95076 United States