目前,我的代码正在解析链接并打印网站上的所有信息。我只想从网站 打印 一个特定的行。我怎么能这样做?
这是我的代码:
from bs4 import BeautifulSoup
import urllib.request
r = urllib.request.urlopen("Link goes here").read()
soup = BeautifulSoup(r, "html.parser")
# This is what I want to change. I currently have it printing everything.
# I just want a specific line from the website
print (soup.prettify())
答案 0 :(得分:3)
不要使用漂亮的打印来尝试解析tds,具体选择标签,如果属性是唯一的,那么使用它,如果类名是唯一的,那么只需使用:
td = soup.select_one("td.content")
td = soup.select_one("td[colspan=3]")
如果是第四个td:
td = soup.select_one("td:nth-of-type(4)")
如果它在特定的表中,那么选择表,然后在表中找到td,尝试将html拆分为行,索引实际上比使用regex to parse html更糟糕。
您可以使用td之前的粗体标记中的文本获取特定的td,即财务部门建筑分类::
In [19]: from bs4 import BeautifulSoup
In [20]: import urllib.request
In [21]: url = "http://a810-bisweb.nyc.gov/bisweb/PropertyProfileOverviewServlet?boro=1&houseno=1&street=park+ave&go2=+GO+&requestid=0"
In [22]: r = urllib.request.urlopen(url).read()
In [23]: soup = BeautifulSoup(r, "html.parser")
In [24]: print(soup.find("b",text="Department of Finance Building Classification:").find_next("td").text)
O6-OFFICE BUILDINGS
选择第n个表和行:
In [25]: print(soup.select_one("table:nth-of-type(8) tr:nth-of-type(5) td[colspan=3]").text)
O6-OFFICE BUILDINGS
答案 1 :(得分:1)
{u'id': u'[redacted]', u'name': u'[redacted]'}