此代码在udacity网站中获取课程链接,并搜索每个链接以找到搜索词(“计算机视觉”)。 如果找到搜索,它将打印该链接。 但是在我的代码中,它还会打印不包含搜索词的链接。对于其他一些搜索词(例如:python),它会省略一些包含搜索词的网址。 可能是什么原因。
没有搜索字词的链接的eg个: https://in.udacity.com/course/advanced-android-app-development--ud855
https://in.udacity.com/course/engagement-monetization-mobile-games--ud407 等。
import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urlencode
from urllib.request import urlopen
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a",class_='capitalize')
search_term = "computer vision"
i=1
for link in courses:
site =urlopen("https://in.udacity.com"+link.get("href")).read()
if search_term in site.decode():
print("https://in.udacity.com"+link.get("href"))
答案 0 :(得分:0)
我认为出现此问题的原因是因为JavaScript代码包含search_term
。
您可以尝试将urlopen().read().decode()
替换为requests.get().text
。
site =urlopen("https://in.udacity.com"+link.get("href")).read()
if search_term in site.decode():
print("https://in.udacity.com"+link.get("href"))
# to
site = requests.get("https://in.udacity.com"+link.get("href"))
if search_term in site.text:
print("https://in.udacity.com"+link.get("href"))
requests.get().text
仅包含显示在浏览器中的字符。
答案 1 :(得分:0)
您可以使用以下内容,但请记住,它也从导航侧栏中获取了Computer Vision。
import requests
from bs4 import BeautifulSoup as bs
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.select('a.track-link')
search_term = "computer vision"
for link in courses:
page = requests.get("https://in.udacity.com" + link['href'])
soup = bs(page.content, 'lxml')
if search_term in soup.select_one('html').text:
print("https://in.udacity.com" + link.get("href"))
答案 2 :(得分:0)
,最后这段代码起作用了。它会从网页中提取所有文本,然后在该文本中进行搜索。
import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'lxml')
courses = soup.select('a.capitalize')
search_term = "computer vision"
for link in courses:
html = urllib.request.urlopen("https://in.udacity.com" + link['href']).read()
if search_term in text_from_html(html):
print("https://in.udacity.com" + link.get("href"))