Question

我正在尝试使用以下网站获取给定邮政编码的城镇和州：

http://www.zip-info.com/cgi-local/zipsrch.exe?zip=10023&Go=Go

使用以下代码，我获得了所有tr标记：

import sys
import os
from bs4 import BeautifulSoup
import requests

r = requests.get("http://www.zip-info.com/cgi-local/zipsrch.exe?zip=10023&Go=Go")
data = r.text
soup = BeautifulSoup(data)
print soup.find_all('tr')

如何找到特定的tr代码？在这样的exmaples：How to find tag with particular text with Beautiful Soup?你已经知道你正在寻找的文本。如果我提前不知道文本，我该怎么用？

修改

我现在添加了以下内容并且无处可去：

for tag in soup.find_all(re.compile("^td align=")):
    print (tag.name)

Answer 1

我会在html来源中导航，直到find()和find_all()次调用混合为止，因为我无法与其他基于po <td>的元素分歧，属性或其他东西：

import sys 
import os
from bs4 import BeautifulSoup
import requests

l = list()


r = requests.get("http://www.zip-info.com/cgi-local/zipsrch.exe?zip=10023&Go=Go")
data = r.text
soup = BeautifulSoup(data)

for table in soup.find('table'):
    center = table.find_all('center')[3]
    for tr in center.find_all('tr')[-1]:
        l.append(tr.string)

print(l[0:-1])

像以下一样运行：

python script.py

产量：

[u'New York', u'NY']

Answer 2

在我看了你提供的网站的HTML代码之后，我会说最好的定位方式是“基于文本的位置”而不是类，基于id的..etc。

首先，您可以使用关键字“Mail”轻松识别基于文本的header行，然后您可以轻松获取包含所需内容的行。

这是我的代码：

import urllib2, re, bs4
soup = bs4.BeautifulSoup(urllib2.urlopen("http://www.zip-info.com/cgi-local/zipsrch.exe?zip=10023&Go=Go"))
# find the header, then find the next tr, which contains your data
tr = soup.find(text=re.compile("Mailing")).find_next("tr")
name, code, zip = [ td.text.strip() for td in tr.find_all("td")]
print name
print code
print zip

打印出来后，它们看起来像这样：

New York
NY
10023

使用beautifulsoup找到一个特定的标签

2 个答案: