Question

目前，我的代码正在解析链接并打印网站上的所有信息。我只想从网站打印 一个特定的行。我怎么能这样做？

这是我的代码：

from bs4 import BeautifulSoup import urllib.request r = urllib.request.urlopen("Link goes here").read() soup = BeautifulSoup(r, "html.parser") # This is what I want to change. I currently have it printing everything. # I just want a specific line from the website print (soup.prettify())

Answer 1

不要使用漂亮的打印来尝试解析tds，具体选择标签，如果属性是唯一的，那么使用它，如果类名是唯一的，那么只需使用：

td = soup.select_one("td.content")
td = soup.select_one("td[colspan=3]")

如果是第四个td：

td = soup.select_one("td:nth-of-type(4)")

如果它在特定的表中，那么选择表，然后在表中找到td，尝试将html拆分为行，索引实际上比使用regex to parse html更糟糕。

您可以使用td之前的粗体标记中的文本获取特定的td，即财务部门建筑分类：：

In [19]: from bs4 import BeautifulSoup

In [20]: import urllib.request

In [21]: url = "http://a810-bisweb.nyc.gov/bisweb/PropertyProfileOverviewServlet?boro=1&houseno=1&street=park+ave&go2=+GO+&requestid=0"

In [22]: r = urllib.request.urlopen(url).read()

In [23]: soup = BeautifulSoup(r, "html.parser")

In [24]: print(soup.find("b",text="Department of Finance Building Classification:").find_next("td").text)
O6-OFFICE BUILDINGS

选择第n个表和行：

In [25]: print(soup.select_one("table:nth-of-type(8) tr:nth-of-type(5) td[colspan=3]").text)
O6-OFFICE BUILDINGS

Answer 2

{u'id': u'[redacted]', u'name': u'[redacted]'}

打印特定行（Beautifulsoup）

2 个答案: