来自网站的数据最终以空数据(bs4,python,lxml)结尾

时间:2018-09-23 05:43:04

标签: beautifulsoup lxml

你好,Stackoverflow的人,

我很难解析网站上的信息 使用Beautifulsoup和lxml。

我试图从“ https://www1.nyc.gov/events/events-filter.html#page-1”网站上获取地址数据。

据我搜索

它说我需要 1.通过网页的“检查”找到信息的特定类别。 2.需要编写类似g_data = soup.find_all("div", {"class": "event-data-detail"})

的代码

所以我写了下面的代码。

import requests
from bs4 import BeautifulSoup

url = "https://www1.nyc.gov/events/events-filter.html#page-1"
r=requests.get("https://www1.nyc.gov/events/events-filter.html#page-1")

soup = BeautifulSoup(r.content)


links = soup.find_all("a")

g_data = soup.find_all("div", {"class": "event-data-detail"})

print(g_data)

并显示错误消息

  

警告(来自警告模块):文件   “ C:/Users/jotna/Desktop/Portfolio/1.py”,第7行       soup = BeautifulSoup(r.content)UserWarning:未明确指定解析器,因此我使用的是最佳的HTML解析器   该系统(“ lxml”)。通常这不是问题,但是如果您运行   此代码在另一个系统或另一个虚拟环境中,   可能会使用其他解析器,并且行为有所不同。

     

导致此警告的代码在文件的第7行   C:/Users/jotna/Desktop/Portfolio/1.py。为了摆脱这种警告,   将附加参数'features =“ lxml”'传递给BeautifulSoup   构造函数。

所以我固定了如下代码。 (因为在stackoverflow中发布了建议在结尾处添加lxml代码的帖子)

import lxml
import requests
from bs4 import BeautifulSoup

url = "https://www1.nyc.gov/events/events-filter.html#page-1"
r=requests.get("https://www1.nyc.gov/events/events-filter.html#page-1")

soup = BeautifulSoup(r.content)


links = soup.find_all("a")

for link in links:
   if "http" in link.get("href"):
       print ("<a href='%s'>%s</a>" %(link.get("href"), link.text))

g_data = soup.find_all("div", {"span class": "address"})

print(g_data)

但是它只显示空括号 []

实际上如何从网站获取地址数据?

为供您参考,我还上传了网页源的屏幕截图。 enter image description here

1 个答案:

答案 0 :(得分:0)

使用其json api代替bs4,请参见下面的代码。

import requests
count = 0
for i in range(185):
    count+=1
    link = 'https://www1.nyc.gov/calendar/api/json/search.htm?&sort=DATE&pageNumber='+str(count)
    req = requests.get(link)
    for i in req.json()['items']:
        address = (i['address'])
        print 'Address:', address

输出

Address: Mulberry Street, Little Italy, New York
Address: Various locations Citywide
Address:  SECOND AVENUE between EAST   32 STREET and EAST   33 STREET  Manhattan
Address:  FIRST AVENUE between EAST   92 STREET and EAST   93 STREET  Manhattan
Address:  CARROLL STREET between SMITH STREET and COURT STREET  Brooklyn
Address:  BROADWAY between WEST  114 STREET and WEST  116 STREET  Manhattan
Address:  CORTELYOU ROAD between RUGBY ROAD and ARGYLE ROAD  Brooklyn
Address:  QUEENS BOULEVARD between 70 AVENUE and 69 ROAD  Queens
Address:  79 STREET between NORTHERN BOULEVARD and 34 AVENUE  Queens
Address:  PRINCE STREET between MOTT STREET and MULBERRY STREET  Manhattan
Address:  BUSHWICK AVENUE between NOLL STREET and ARION PLACE  Brooklyn
Address: Alley Pond Park Adventure Center
Address: Atlantic Avenue between 4th Avenue and Hicks Street
Address: Alexander von Humboldt statue - Central Park West and 77th Street
Address:  SEVENTH AVENUE between WEST  110 STREET and WEST  111 STREET  Manhattan
Address: Wave Hill House - West 249th Street and Independence Avenue
Address: Broadway between Liberty Street and Rector Street
Address: Anibal Aviles Playground
Address: Myrtle Avenue between Fresh Pond Road and Wyckoff Avenue