我能够为首页和最后一页编写代码,但只能提取CSV中的第1页数据。我需要将所有10页数据提取到CSV中。我在哪里写错了?
导入已安装的模块
import requests
from bs4 import BeautifulSoup
import csv
要从网页获取数据,我们将使用请求get()方法
url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
page = requests.get(url)
要检查http响应状态代码
print(page.status_code)
现在我已经从网页上收集了数据,让我们看看得到了什么
print(page.text)
可以使用beautifulsoup的prettify()方法以漂亮的格式查看以上数据。为此,我们将创建一个bs4对象并使用prettify方法
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Name", "Address", "Phone"])
查找包含公司信息的所有DIV
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
提取第一页和最后一页的页码
paging = soup.find("div",{"class":"pg-full-width me-pagination"}).find("ul",{"class":"pagination"}).find_all("a")
start_page = paging[1].text
last_page = paging[len(paging)-2].text
现在循环遍历这些元素
for element in product_name_list:
获取1个“ div”,{“ class”:“ CompanyInfo”}标签的块并查找/存储名称,地址,电话
name = element.find('h2').text
address = element.find('address').text.strip()
phone = element.find("ul",{"class":"submenu"}).text.strip()
将姓名,地址,电话写到csv
writer.writerow([name, address, phone])
现在将转到下一个“ div”,{“ class”:“ CompanyInfo”}标签并重复
outfile.close()
答案 0 :(得分:1)
只需更多循环即可。您现在需要循环浏览每个页面的网址:请参见下文。
import requests
from bs4 import BeautifulSoup
import csv
root_url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
html = requests.get(root_url)
soup = BeautifulSoup(html.text, 'html.parser')
paging = soup.find("div",{"class":"pg-full-width me-pagination"}).find("ul",{"class":"pagination"}).find_all("a")
start_page = paging[1].text
last_page = paging[len(paging)-2].text
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Name", "Address", "Phone"])
pages = list(range(1,int(last_page)+1))
for page in pages:
url = 'https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore&page=%s' %(page)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
#print(soup.prettify())
print ('Processing page: %s' %(page))
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
for element in product_name_list:
name = element.find('h2').text
address = element.find('address').text.strip()
phone = element.find("ul",{"class":"submenu"}).text.strip()
writer.writerow([name, address, phone])
outfile.close()
print ('Done')
答案 1 :(得分:0)
您应该使用页面属性以及https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore& page = 2
10页的示例代码:
url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore&page={}"
for page_num in range(1, 10):
page = requests.get(url.format(page_num)
#further processing