在一组网页中查找特定词

时间:2019-02-12 10:42:36

标签: python web-scraping

此代码在udacity网站中获取课程链接,并搜索每个链接以找到搜索词(“计算机视觉”)。 如果找到搜索,它将打印该链接。 但是在我的代码中,它还会打印不包含搜索词的链接。对于其他一些搜索词(例如:python),它会省略一些包含搜索词的网址。 可能是什么原因。

没有搜索字词的链接的

eg个: https://in.udacity.com/course/advanced-android-app-development--ud855

https://in.udacity.com/course/engagement-monetization-mobile-games--ud407 等。

import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urlencode
from urllib.request import urlopen
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a",class_='capitalize')
search_term = "computer vision"
i=1
for link in courses:
    site =urlopen("https://in.udacity.com"+link.get("href")).read()
    if search_term in site.decode():
        print("https://in.udacity.com"+link.get("href"))

3 个答案:

答案 0 :(得分:0)

我认为出现此问题的原因是因为JavaScript代码包含search_term

您可以尝试将urlopen().read().decode()替换为requests.get().text

site =urlopen("https://in.udacity.com"+link.get("href")).read()
if search_term in site.decode():
    print("https://in.udacity.com"+link.get("href"))
# to
site = requests.get("https://in.udacity.com"+link.get("href"))
if search_term in site.text:
    print("https://in.udacity.com"+link.get("href"))

requests.get().text仅包含显示在浏览器中的字符。

答案 1 :(得分:0)

您可以使用以下内容,但请记住,它也从导航侧栏中获取了Computer Vision。

import requests
from bs4 import BeautifulSoup as bs

page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.select('a.track-link')
search_term = "computer vision"

for link in courses:
    page = requests.get("https://in.udacity.com" + link['href'])
    soup = bs(page.content, 'lxml')
    if search_term in soup.select_one('html').text:
        print("https://in.udacity.com" + link.get("href"))

答案 2 :(得分:0)

,最后这段代码起作用了。它会从网页中提取所有文本,然后在该文本中进行搜索。

import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'lxml')
courses = soup.select('a.capitalize')

search_term = "computer vision"

for link in courses:
    html = urllib.request.urlopen("https://in.udacity.com" + link['href']).read()
    if search_term in text_from_html(html):
        print("https://in.udacity.com" + link.get("href"))