我使用请求和BeautifulSoup创建了一个python脚本,以解析配置文件名称以及从网页指向其配置文件名称的链接。 内容似乎是动态生成的,但它们存在于页面源中 。因此,我尝试了以下方法,但不幸的是我什么也没得到。
到目前为止我的尝试:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.century21.com/real-estate-agents/Dallas,TX'
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,bn;q=0.8',
'cache-control': 'max-age=0',
'cookie': 'JSESSIONID=8BF2F6FB5603A416DCFBAB8A3BB5A79E.app09-c21-id8; website_user_id=1255553501;',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
def get_info(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".media__content"):
profileUrl = item.get("href")
profileName = item.select_one("[itemprop='name']").get_text()
print(profileUrl,profileName)
if __name__ == '__main__':
get_info(URL)
如何从该页面获取内容?
答案 0 :(得分:1)
所需的内容确实在页面源中可用。当使用相同的user-agent
发出请求时,该站点非常擅长丢弃请求。因此,我使用fake_useragent
来随机地提供请求。如果您不经常使用它,它将起作用。
工作解决方案:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from fake_useragent import UserAgent
URL = 'https://www.century21.com/real-estate-agents/Dallas,TX'
def get_info(s,link):
s.headers["User-Agent"] = ua.random
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".media__content a[itemprop='url']"):
profileUrl = urljoin(link,item.get("href"))
profileName = item.select_one("span[itemprop='name']").get_text()
print(profileUrl,profileName)
if __name__ == '__main__':
ua = UserAgent()
with requests.Session() as s:
get_info(s,URL)
部分输出:
https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Stewart-Kipness-2657107a Stewart Kipness
https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Andrea-Anglin-Bulin-2631495a Andrea Anglin Bulin
https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Betty-DeVinney-2631507a Betty DeVinney
https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Sabra-Waldman-2657945a Sabra Waldman
https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Russell-Berry-2631447a Russell Berry
答案 1 :(得分:1)
页面内容不是通过javascript呈现的。就我而言,您的代码很好。
您只有一些问题才能找到profileUrl并处理nonetype
异常。您必须关注a
标签才能获取数据
您应该尝试以下操作:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.century21.com/real-estate-agents/Dallas,TX'
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,bn;q=0.8',
'cache-control': 'max-age=0',
'cookie': 'JSESSIONID=8BF2F6FB5603A416DCFBAB8A3BB5A79E.app09-c21-id8; website_user_id=1255553501;',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
def get_info(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
results = []
for item in soup.select(".media__content"):
a_link = item.find('a')
if a_link:
result = {
'profileUrl': a_link.get('href'),
'profileName' : a_link.get_text()
}
results.append(result)
return results
if __name__ == '__main__':
info = get_info(URL)
print(info)
print(len(info))
输出:
[{'profileName': 'Stewart Kipness',
'profileUrl': '/CENTURY-21-Judge-Fite-Company-14501c/Stewart-Kipness-2657107a'},
....,
{'profileName': 'Courtney Melkus',
'profileUrl': '/CENTURY-21-Realty-Advisors-47551c/Courtney-Melkus-7389925a'}]
941
答案 2 :(得分:1)
您似乎也可以构造url(尽管抓取它似乎更容易)
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://www.century21.com/real-estate-agents/Dallas,TX'
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,bn;q=0.8',
'cache-control': 'max-age=0',
'cookie': 'JSESSIONID=8BF2F6FB5603A416DCFBAB8A3BB5A79E.app09-c21-id8; website_user_id=1255553501;',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
r = requests.get(URL, headers = headers)
soup = bs(r.content, 'lxml')
items = soup.select('.media')
ids = []
names = []
urls = []
for item in items:
if item.select_one('[data-agent-id]') is not None:
anId = item.select_one('[data-agent-id]')['data-agent-id']
ids.append(anId)
name = item.select_one('[itemprop=name]').text.replace(' ','-')
names.append(name)
url = 'https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/' + name + '-' + anId + 'a'
urls.append(url)
results = list(zip(names, urls))
print(results)
答案 3 :(得分:0)
请尝试:
profileUrl = "https://www.century21.com/" + item.select("a")[0].get("href")