Question

问题已更新，请参阅下文

我正试图抓住城市和州的邮政编码。这是有效的代码：

r = requests.get("http://www.city-data.com/zips/11021.html")
data = r.text
soup = BeautifulSoup(data)
main_body = soup.find(id="main_body").findAll('a')[5].string
print main_body

我得到以下内容，正确的字符串：

Great Neck Plaza, NY

以下代码没有（它打印错误的字符串）：

zipCode = str(10023)
url = "http://www.city-data.com/zips/" + zipCode + ".html"
print url
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
main_body = soup.find(id="main_body").findAll('a')[5].string
print main_body

这是错误的字符串：

Recent home sales, real estate maps, and home value estimator for zip code 10023

为什么我不能使用字符串作为邮政编码？我还能做什么，因为我正在尝试编写一个查找城市和州的功能？

更新

根据一些建议，我现在正在搜索我想要的标签之前的文本。这是我正在搜索的文本，然后是我真正想要的信息：

<b>City:</b>
 <a href="/city/New-York-New-York.html">New York, NY</a>

这是我不想尝试的代码：

zipCode = str(11021)
url = "http://www.city-data.com/zips/" + zipCode + ".html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
main_body = soup.findAll(text="City:")
print main_body

然而，我得到的只是空括号。如何搜索City:文本，然后获取下一个标记的字符串？

Answer 1

您的代码正常运行，但解决方案的前提是不正确的。您的代码（findAll('a')[5]）假定您所访问的数据将位于每个邮政编码页面的相同位置。但是，如果您查看zips 11021和10023的页面，您会发现它们没有相同数量的超链接。您需要找到另一种定位数据的方法，而不是简单地抓取页面上超链接数组的索引5。

Answer 2

这里的代码对我有用：

zipCode = str("07928")
url = "http://www.city-data.com/zips/" + zipCode + ".html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
if soup.findAll(text="City:") ==[]:
    cityNeeded = soup.findAll(text="Cities:")
    for t in cityNeeded:
        print t.find_next('a').string
else:
    cityNeeded = soup.findAll(text="City:")
    for t in cityNeeded:
        print t.find_next('a').string

为什么我在连接Beautifulsoup的网址时使用字符串？

2 个答案: