我在抓取the webpage
时遇到问题URL从1开始增加30。它包含许多页面,其中包含肯尼亚的中学列表。每个页面上都有30所学校的列表。我想用以下代码抓取所有数据,但它只提供一页内容,即30所学校。我已将字符串格式化为url,但仍返回一页的数据。我的代码:
#IMPORTING RELEVANT PACKAGES FOR THE WORK
import requests
from bs4 import BeautifulSoup
import time
#DEFINING THE FIRST WEBPAGE
num = 1
#STRING FORMATTING THE URL TO CAPTURE DIFFRENT PAGES
url = 'https://www.kenyaplex.com/schools/?start={}&SchoolType=private-secondary-schools'.format(num)
#DEIFING THE BROWSER HEADERS SO THAT CAN WORK ON IT WITHOUT ERRORS
headers = {'User-Agent':'Mozilla'}
#GOING THROUGH ALL THE PAGES AND THE LINKS
while num < 452:
url = 'https://www.kenyaplex.com/schools/?start={}&SchoolType=private-secondary-schools'.format(num)
time.sleep(1)
num += 30
response = requests.get(url,headers)
soup = BeautifulSoup(response.text,'html.parser')
school_info = soup.find_all('div', attrs={'class':'c-detail'})
#EXTRACTING SPECIFIC RECORDS
records = []
for name in school_info:
Name_of_The_School = name.find('a').text
Location_of_The_School = name.contents[2][2:]
Contact_of_The_School = name.contents[4]
Information_Link = name.find('a')['href']
#converting the records to a tuple
records.append((Name_of_The_School,
Location_of_The_School,
Contact_of_The_School,
Information_Link))
#EXPORTING TO A PANDAS FILE
import pandas as pd
df = pd.DataFrame(records, columns = ['Name of The School',
'Location of The School',
'Contact of The School',
'Information_Link'])
df.to_csv('PRIVATE_SECONDARY.csv', index = False, encoding = 'utf-8')
答案 0 :(得分:0)
将records = []
移到while
循环之外:
records = []
while num < 452:
url = 'https://www.kenyaplex.com/schools/?start={}&SchoolType=private-secondary-schools'.format(num)
time.sleep(1)
num += 30
response = requests.get(url,headers)
soup = BeautifulSoup(response.text,'html.parser')
school_info = soup.find_all('div', attrs={'class':'c-detail'})
#EXTRACTING SPECIFIC RECORDS
for name in school_info:
Name_of_The_School = name.find('a').text
Location_of_The_School = name.contents[2][2:]
Contact_of_The_School = name.contents[4]
Information_Link = name.find('a')['href']
#converting the records to a tuple
records.append((Name_of_The_School,
Location_of_The_School,
Contact_of_The_School,
Information_Link))
答案 1 :(得分:0)
逻辑很差,在while
循环的每次迭代中,它都会覆盖局部变量school_info
,因此,在下一个for
循环中剩下的是最后一个while
循环中的值。
我自由地对其进行了重组:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
headers = {'User-Agent':'Mozilla'}
def get_url(batch):
return 'https://www.kenyaplex.com/schools/?start={}&SchoolType=private-secondary-schools'.format(batch)
school_data = []
records = []
for batch in range(1, 453, 30): # the scrapper saves the results per iteration
response = requests.get(get_url(batch), headers)
soup = BeautifulSoup(response.text,'html.parser')
school_info = soup.find_all('div', attrs={'class':'c-detail'})
school_data.extend(school_info)
for name in school_data: # further parsing and records collection
Name_of_The_School = name.find('a').text
Location_of_The_School = name.contents[2][2:]
Contact_of_The_School = name.contents[4]
Information_Link = name.find('a')['href']
records.append((Name_of_The_School,Location_of_The_School,Contact_of_The_School,Information_Link))
time.sleep(1)
df = pd.DataFrame(records, columns = ['Name of The School','Location of The School','Contact of The School','Information_Link'])
df.to_csv('PRIVATE_SECONDARY.csv', index = False, encoding = 'utf-8')