我和一个朋友正在为密歇根竞选财务网站开发这个网络抓取工具。我们想在这个工具上实现分页,但不知道如何去做。现在,代码成功抓取并写入 csv,但仅对 url 中的指定页面执行此操作(请参阅下面的 url 链接)。谁能帮助我们在这个工具上实现分页?我尝试了 .format() 和 for 循环方法,但没有成功。我的代码如下。
import requests
import requests_cache
import lxml.html as lh
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url = 'https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions?schedule=1A&changes=0&page=11'
#requests_cache.install_cache(cache_name='whitmer_donor_cache', backend='sqlite', expire_after=180)
#Scrape Table Cells
page = requests.get(base_url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
#print([len(T) for T in tr_elements[:12]])
#Parse Table Header
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
i += 1
name = t.text_content()
print('%d:"%s"'%(i,name))
col.append((name,[]))
###Create Pandas Dataframe###
for j in range(1,len(tr_elements)):
T = tr_elements[j]
if len(T)!=9:
break
i = 0
for t in T.iterchildren():
data = t.text_content()
if i>0:
try:
data = int(data)
except:
pass
col[i][1].append(data)
i+=1
#print([len(C) for (title,C) in col])
###Format Dataframe###
Dict = {title:column for (title,column) in col}
df = pd.DataFrame(Dict)
df = df.replace('\n','', regex=True)
df = df.replace(' ', ' ', regex=True)
df['Receiving Committee'] = df['Receiving Committee'].apply(lambda x : x.strip().capitalize())
###Print Dataframe###
with pd.option_context('display.max_rows', 10, 'display.max_columns', 10): # more options can be specified also
print(df)
df.to_csv('Whitmer_Donors.csv', mode='a', header=False)
#create excel writer
#writer = pd.ExcelWriter("Whitmer_Donors.xlsx")
#write dataframe to excel#
#df.to_excel(writer)
#writer.save()
print("Dataframe is written successfully to excel")
有什么关于如何进行的建议吗?
答案 0 :(得分:0)
您提到使用 {}
,但我在您提供的代码中没有看到它。给定的 URL 有一个 .format()
参数,您可以将其与 page
一起使用:
str.format()
理想情况下,如果您得到 404 结果或任何错误,您应该在不设置上限的情况下继续增加 # note the braces at the end
base_url = 'https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions?schedule=1A&changes=0&page={}'
for page_num in range(1, 100):
url = base_url.format(page_num)
page = requests.get(url) # use `url` here, not `base_url`
... # rest of your code
和 page_num
。
break
我强烈建议您将脚本的各个部分放入可以使用不同参数调用的可重用函数中。将其拆分为更小、更易于管理的部分,以便于使用和调试。
答案 1 :(得分:0)
我建议使用 requests.get
params
参数,如下所示:
params = {"schedule": "1A", "changes": '0', "page": "1"}
page = requests.get(base_url, params=params)
它会自动为您创建正确的 URL。
此外,为了获得所有页面,您可以循环遍历它们。当您点击一个空数据框时,您假设所有数据都已下载并退出循环。我已经实现了一个 for
循环,迭代次数为 41,因为我知道有多少页,但如果您不知道 - 您可以设置一个非常的数字。
如果您不想在代码中使用“神奇”数字,只需使用 while 循环即可。但要小心不要陷入无尽的...
我冒昧地将您的代码更改为更实用的方法。展望未来,您可能希望进一步模块化。
import requests
import requests_cache
import lxml.html as lh
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url = 'https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions'
#requests_cache.install_cache(cache_name='whitmer_donor_cache', backend='sqlite', expire_after=180)
def get_page(page_url, params):
#Scrape Table Cells
page = requests.get(page_url, params=params)
print(page.text)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
#print([len(T) for T in tr_elements[:12]])
#Parse Table Header
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
i += 1
name = t.text_content()
print('%d:"%s"' % (i, name))
col.append((name, []))
# print(col)
###Create Pandas Dataframe###
for j in range(1, len(tr_elements)):
T = tr_elements[j]
if len(T) != 9:
break
i = 0
for t in T.iterchildren():
data = t.text_content().strip()
if i > 0:
try:
data = int(data)
except:
pass
col[i][1].append(data)
i += 1
# print(col[0:3])
#print([len(C) for (title,C) in col])
###Format Dataframe###
Dict = {el[0]: el[1] for el in col}
Dict = {title: column for (title, column) in col}
print(col[1])
print(Dict.keys())
df = pd.DataFrame(Dict)
df = df.replace('\n', '', regex=True)
df = df.replace(' ', ' ', regex=True)
df['Receiving Committee'] = df['Receiving Committee'].apply(
lambda x: x.strip().capitalize())
###Print Dataframe###
with pd.option_context('display.max_rows', 10, 'display.max_columns',
10): # more options can be specified also
print(df)
return df
def get_all_pages(base_url):
df_list = []
for i in range(1, 42):
params = {"schedule": "1A", "changes": '0', "page": str(i)}
df = get_page(base_url, params)
print(df)
if df.empty:
print("Empty dataframe! All done.")
break
df_list.append(df)
print(df)
print('====================================')
return df_list
df_list = get_all_pages(base_url)
pd.concat(df_list).to_csv('Whitmer_Donors.csv', mode='w', header=False)
#create excel writer
#writer = pd.ExcelWriter("Whitmer_Donors.xlsx")
#write dataframe to excel#
#df.to_excel(writer)
#writer.save()
print("Dataframe is written successfully to excel")
答案 2 :(得分:0)
这是一个稍微不同的实现。使用read_html()
直接获取表格给pandas,然后使用soup
查找下一页。如果没有下一页,程序将退出。您正在抓取的这个页面有 40 页,因此例如从 38 页开始,它将退出并打印 300 行的 df。您可以在最后对数据框进行任何修改。
# this function looks for the next page url; returns None if it isn't there
def parse(soup):
try:
return json.loads(soup.find('search-results').get(':pagination'))['next_page_url']
except:
return None
start_urls = ['https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions?schedule=1A&changes=0&page=38'] # change to 1 for the full run
df_hold_list = [] # collect your dataframes to concat later
for url in start_urls: # you can iterate through different urls or just the one
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
df = pd.read_html(url)[0]
df_hold_list.append(df)
next = parse(soup)
while True:
if next:
print(next)
page = requests.get(next)
soup = BeautifulSoup(page.text, "html.parser")
df = pd.read_html(url)[0]
df_hold_list.append(df)
next = parse(soup)
else:
break
df_final = pd.concat(df_hold_list)
df_final.shape
(300, 9) # 300 rows, 9 columns