在beautifulsoup找到字符串

时间:2013-12-02 12:00:34

标签: python beautifulsoup findall

我正在搜索我想要的标签之前的文本City:城市和州字符串。这是html:

<b>City:</b>
  <a href="/city/New-York-New-York.html">New York, NY</a>

这是代码:

zipCode = str(11021)
url = "http://www.city-data.com/zips/" + zipCode + ".html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
main_body = soup.findAll(text="City:")
print main_body
然而,我得到的只是空括号。如何搜索City:文本,然后获取下一个标记的字符串?

2 个答案:

答案 0 :(得分:0)

您可以在文本节点中使用next_elements,直到找到<a>标记并提取其文本:

from bs4 import BeautifulSoup
import sys

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')

for t in soup.find_all(text="City:"):
    print(t)
    for e in t.next_elements:
        if e.name == 'a':
            print(e.string)
            break

运行它(asumming htmlfile包含问题的测试数据):

python3 script.py htmlfile

产量:

City:
New York, NY

答案 1 :(得分:0)

来自@Birei和@JohnClements的回答让我大部分都在那里,但这里的代码对我有用:

zipCode = str("07928")
url = "http://www.city-data.com/zips/" + zipCode + ".html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
if soup.findAll(text="City:") ==[]:
    cityNeeded = soup.findAll(text="Cities:")
    for t in cityNeeded:
        print t.find_next('a').string
else:
    cityNeeded = soup.findAll(text="City:")
    for t in cityNeeded:
        print t.find_next('a').string