你好,Stackoverflow的人,
我很难解析网站上的信息 使用Beautifulsoup和lxml。
我试图从“ https://www1.nyc.gov/events/events-filter.html#page-1”网站上获取地址数据。
据我搜索
它说我需要
1.通过网页的“检查”找到信息的特定类别。
2.需要编写类似g_data = soup.find_all("div", {"class": "event-data-detail"})
所以我写了下面的代码。
import requests
from bs4 import BeautifulSoup
url = "https://www1.nyc.gov/events/events-filter.html#page-1"
r=requests.get("https://www1.nyc.gov/events/events-filter.html#page-1")
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
g_data = soup.find_all("div", {"class": "event-data-detail"})
print(g_data)
并显示错误消息
警告(来自警告模块):文件 “ C:/Users/jotna/Desktop/Portfolio/1.py”,第7行 soup = BeautifulSoup(r.content)UserWarning:未明确指定解析器,因此我使用的是最佳的HTML解析器 该系统(“ lxml”)。通常这不是问题,但是如果您运行 此代码在另一个系统或另一个虚拟环境中, 可能会使用其他解析器,并且行为有所不同。
导致此警告的代码在文件的第7行 C:/Users/jotna/Desktop/Portfolio/1.py。为了摆脱这种警告, 将附加参数'features =“ lxml”'传递给BeautifulSoup 构造函数。
所以我固定了如下代码。 (因为在stackoverflow中发布了建议在结尾处添加lxml代码的帖子)
import lxml
import requests
from bs4 import BeautifulSoup
url = "https://www1.nyc.gov/events/events-filter.html#page-1"
r=requests.get("https://www1.nyc.gov/events/events-filter.html#page-1")
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
if "http" in link.get("href"):
print ("<a href='%s'>%s</a>" %(link.get("href"), link.text))
g_data = soup.find_all("div", {"span class": "address"})
print(g_data)
但是它只显示空括号 []
实际上如何从网站获取地址数据?
为供您参考,我还上传了网页源的屏幕截图。 enter image description here
答案 0 :(得分:0)
使用其json api代替bs4,请参见下面的代码。
import requests
count = 0
for i in range(185):
count+=1
link = 'https://www1.nyc.gov/calendar/api/json/search.htm?&sort=DATE&pageNumber='+str(count)
req = requests.get(link)
for i in req.json()['items']:
address = (i['address'])
print 'Address:', address
输出
Address: Mulberry Street, Little Italy, New York
Address: Various locations Citywide
Address: SECOND AVENUE between EAST 32 STREET and EAST 33 STREET Manhattan
Address: FIRST AVENUE between EAST 92 STREET and EAST 93 STREET Manhattan
Address: CARROLL STREET between SMITH STREET and COURT STREET Brooklyn
Address: BROADWAY between WEST 114 STREET and WEST 116 STREET Manhattan
Address: CORTELYOU ROAD between RUGBY ROAD and ARGYLE ROAD Brooklyn
Address: QUEENS BOULEVARD between 70 AVENUE and 69 ROAD Queens
Address: 79 STREET between NORTHERN BOULEVARD and 34 AVENUE Queens
Address: PRINCE STREET between MOTT STREET and MULBERRY STREET Manhattan
Address: BUSHWICK AVENUE between NOLL STREET and ARION PLACE Brooklyn
Address: Alley Pond Park Adventure Center
Address: Atlantic Avenue between 4th Avenue and Hicks Street
Address: Alexander von Humboldt statue - Central Park West and 77th Street
Address: SEVENTH AVENUE between WEST 110 STREET and WEST 111 STREET Manhattan
Address: Wave Hill House - West 249th Street and Independence Avenue
Address: Broadway between Liberty Street and Rector Street
Address: Anibal Aviles Playground
Address: Myrtle Avenue between Fresh Pond Road and Wyckoff Avenue