Question

我正在搜索我想要的标签之前的文本City：城市和州字符串。这是html：

<b>City:</b>
  <a href="/city/New-York-New-York.html">New York, NY</a>

这是代码：

zipCode = str(11021)
url = "http://www.city-data.com/zips/" + zipCode + ".html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
main_body = soup.findAll(text="City:")
print main_body

然而，我得到的只是空括号。如何搜索City:文本，然后获取下一个标记的字符串？

Answer 1

您可以在文本节点中使用next_elements，直到找到<a>标记并提取其文本：

from bs4 import BeautifulSoup
import sys

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')

for t in soup.find_all(text="City:"):
    print(t)
    for e in t.next_elements:
        if e.name == 'a':
            print(e.string)
            break

运行它（asumming htmlfile包含问题的测试数据）：

python3 script.py htmlfile

产量：

City:
New York, NY

Answer 2

来自@Birei和@JohnClements的回答让我大部分都在那里，但这里的代码对我有用：

zipCode = str("07928")
url = "http://www.city-data.com/zips/" + zipCode + ".html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
if soup.findAll(text="City:") ==[]:
    cityNeeded = soup.findAll(text="Cities:")
    for t in cityNeeded:
        print t.find_next('a').string
else:
    cityNeeded = soup.findAll(text="City:")
    for t in cityNeeded:
        print t.find_next('a').string

在beautifulsoup找到字符串

2 个答案: