我是Python编程的初学者。我正在使用python中的bs4模块练习网络抓取。
我从网页中提取了一些字段,但它仅提取13个项目,而该网页中有13个以上的项目。我不明白为什么其余的项目没有提取出来。
另一件事是我想提取网页上每个项目的联系电话和电子邮件地址,但是它们在项目的相应链接中可用。我是一个初学者,坦白地说,我对如何访问和刮取给定网页中每个项目的单独网页的链接感到困惑。请告诉我我在哪里做错了,如果可能的话,建议要做什么。
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
res = requests.post('https://www.nelsonalexander.com.au/real-estate-agents/?office=&agent=A')
soup = bs(res.content, 'lxml')
data = soup.find_all("div",{"class":"agent-card large large-3 medium-4 small-12 columns text-center end"})
records = []
for item in data:
name = item.find('h2').text.strip()
position = item.find('h3').text.strip()
records.append({'Names': name, 'Position': position})
df = pd.DataFrame(records,columns=['Names','Position'])
df=df.drop_duplicates()
df.to_excel(r'C:\Users\laptop\Desktop\NelsonAlexander.xls', sheet_name='MyData2', index = False, header=True)
我已经完成了上面的代码,只是提取了每个项目的名称和位置,但是它只刮取13条记录,但是网页中的记录比它多。我无法编写用于提取每条记录的联系电话和电子邮件地址的任何代码,因为当我被卡住时,它们就出现在每个项目的单独页面中。
Excel工作表如下所示:
答案 0 :(得分:0)
该网站会在您滚动时动态加载列表,但是您可以跟踪ajax
请求并直接解析数据:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0', 'Referer': 'https://www.nelsonalexander.com.au/real-estate-agents/?office=&agent=A'}
records = []
with requests.Session() as s:
for i in range(1, 22):
res = s.get(f'https://www.nelsonalexander.com.au/real-estate-agents/page/{i}/?ajax=1&agent=A', headers=headers)
soup = bs(res.content, 'lxml')
data = soup.find_all("div",{"class":"agent-card large large-3 medium-4 small-12 columns text-center end"})
for item in data:
name = item.find('h2').text.strip()
position = item.find('h3').text.strip()
phone = item.find("div",{"class":"small-6 columns text-left"}).find("a").get('href').replace("tel:", "")
records.append({'Names': name, 'Position': position, 'Phone': phone})
df = pd.DataFrame(records,columns=['Names','Position', 'Phone'])
df=df.drop_duplicates()
df.to_excel(r'C:\Users\laptop\Desktop\NelsonAlexander.xls', sheet_name='MyData2', index = False, header=True)
答案 1 :(得分:0)
我深信电子邮件在DOM中无处可寻。我对@ drec4s代码进行了一些修改,改为(动态地)直到没有任何条目为止。
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import itertools
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0', 'Referer': 'https://www.nelsonalexander.com.au/real-estate-agents/?office=&agent=A'}
records = []
with requests.Session() as s:
for i in itertools.count():
res = s.get('https://www.nelsonalexander.com.au/real-estate-agents/page/{}/?ajax=1&agent=A'.format(i), headers=headers)
soup = bs(res.content, 'lxml')
data = soup.find_all("div",{"class":"agent-card large large-3 medium-4 small-12 columns text-center end"})
if(len(data) > 0):
for item in data:
name = item.find('h2').text.strip()
position = item.find('h3').text.strip()
phone = item.find("div",{"class":"small-6 columns text-left"}).find("a").get('href').replace("tel:", "")
records.append({'Names': name, 'Position': position, 'Phone': phone})
print({'Names': name, 'Position': position, 'Phone': phone})
else:
break