Question

我正在尝试抓取this网站，以使用python准备用于献血营地的数据库。

首先，在尝试从请求或urllib获取网站html源代码时，存在一个SSl：certificate_verify_error，我已通过将request.get（）的verify参数设置为False或为urllib创建未经验证的上下文来绕过（快速修复），这使我克服了错误，但是当我看到检索到的源html代码时，我需要的表内容为空，在网站源中它们包含在tbody标记中，但是我的request.get（）命令仅向我提供这些内容标签，而不是它们之间的内容。我对抓取还很陌生，请多加指导。 ty

from urllib.request import urlopen as uReq
import ssl
from bs4 import BeautifulSoup as soup

my_url = 'https://www.eraktkosh.in/BLDAHIMS/bloodbank/campSchedule.cnt'
sp_context = ssl._create_unverified_context()
uClient = uReq(my_url,context=sp_context)
page_html = uClient.read()
uClient.close()
page_soup=soup(page_html,"html.parser")
table = page_soup.find('tbody')
print (table) #this outputs "<tbody></tbody>"
trow = table.find('tr')
print (trow) #this outputs "None"

第一个打印命令给出

<tbody>
</tbody>

和第二个输出

None

Answer 1

之所以如此，是因为第一个请求返回的HTML支架几乎为空。

您在页面上看到的数据将由后续的ajax请求填充。确切地说，https://www.eraktkosh.in/BLDAHIMS/bloodbank/nearbyBB.cnt?hmode=GETNEARBYCAMPS&stateCode=-1&districtCode=-1&_=1560150852947

您可以通过右键单击->检查->网络标签并重新加载页面来检索此信息。

意见：从此页面提取信息不需要BeautifulSoup。可以从上述API以json格式轻松获得数据。

希望这会有所帮助。

Answer 2

查看此HTTP调用：

https://www.eraktkosh.in/BLDAHIMS/bloodbank/nearbyBB.cnt?hmode=GETNEARBYCAMPS&stateCode=-1&districtCode=-1&_=1560150750074

这是数据的来源。

您有2个选择：

执行HTTP调用并解析响应
使用无头浏览器抓取网站。参见here。

Answer 3

使用pandas库将数据保存到csv文件中。

在浏览器的network标签中，您将看到JSON data response表数据中的campSchedule。

import requests
import  pandas as pd

url = 'https://www.eraktkosh.in/BLDAHIMS/bloodbank/nearbyBB.cnt?hmode=GETNEARBYCAMPS&stateCode=-1&districtCode=-1&_=1560150855565'
jsonData = requests.get(url, verify=False).json()

campScheduleData = []

for data in jsonData['data']:
    campSchedule = {"Date":"","Time":"","Camp Name":"","Address":"","State":"","District":"",\
                    "Contact":"","Conducted By":"","Organized by":"","Register":""}
    if "<br/>" in data[1]:
        campSchedule['Date'] = data[1].split("<br/>")[0]

    if "href" in data[10]:
        campSchedule['Register'] = "https://www.eraktkosh.in" + data[10].split("href=")[1].split(" ")[0]

    campSchedule['Time'] = data[2]
    campSchedule['Camp Name'] = data[3]
    campSchedule['Address'] = data[4]
    campSchedule['State'] = data[5]
    campSchedule['District'] = data[6]
    campSchedule['Contact'] = data[7]
    campSchedule['Conducted By'] = data[8]
    campSchedule['Organized by'] = data[9]
    campScheduleData.append(campSchedule)

df = pd.DataFrame(campScheduleData)
# it will save csv file in current project directory with campScheduleData.csv file name
df.to_csv("campSchedule.csv")

如果不安装熊猫，请安装它：

pip3 install pandas

Answer 4

使用熊猫并重新

import requests
import pandas as pd
import urllib3; urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
import re

p1 = re.compile(r"(.*?)<br/>")
p2 = re.compile(r"href='(.*?)'")

def get_url(html, p): 
    if html == 'NA':
        url = html
    else:
        url = 'https://www.eraktkosh.in' + p.findall(html)[0]
    return url

def get_date(html, p): 
    if html == 'NA':
        date_string = html
    else:
        date_string = p.findall(html)[0]
    return date_string

r = requests.get('https://www.eraktkosh.in/BLDAHIMS/bloodbank/nearbyBB.cnt?hmode=GETNEARBYCAMPS&stateCode=-1&districtCode=-1&_=1560150750074', verify = False).json()
df = pd.DataFrame(r['data'])
df[1] = df[1].apply(lambda x: get_date(x, p1))
df[10] = df[10].apply(lambda x: get_url(x, p2))
print(df)

Web Scraper无法从网站获取完整数据

4 个答案: