Python - Web抓取问题

时间:2016-05-02 22:03:19

标签: python web screen-scraping

我在尝试获取频道标题时遇到网络抓取工具问题。我不确定如何修复它,但是通过使用频道功能进行一些测试,当只有频道链接应该与YoutubeChannel功能一起使用时,视频链接似乎与它一起工作。

关于如何修复它的任何想法?

#Required Modules
import urllib
import re

#Defining the YouTube Video function
def YoutubeVideo():
    #Making videoLink equal to whatever the user enters as their video link
    videoLink = input ('\nWhat is your video link? (In quotations, with http included)\n')

    #Goes to the video URL, opens it and reads the HTML file
    htmlfile = urllib.urlopen(videoLink) #Searches for this URL
    htmltext = htmlfile.read() #Reads the HTML file and sets it to htmltext

    #Setup for the view counter
    regexView = "<div class=\"watch-view-count\">(.+?)</div>" #Searches for the view count number and sets it to regexView
    pattern = re.compile(regexView)
    viewCount = re.findall(pattern, htmltext) 

    #Setup for the video title
    regexTitle = "<title>(.+?)</title>" #Searches for the title of the video
    patternTitle = re.compile(regexTitle)
    videoTitle = re.findall(patternTitle, htmltext)

    #Setup for the video upload date
    regexUpload = "<strong class=\"watch-time-text\">(.+?)</strong>"
    patternUpload = re.compile(regexUpload)
    videoUpload = re.findall(patternUpload, htmltext)

    print ("\n%s" % (videoLink)) #Prints the video link, primarily for testing
    print ("\nThe title of your video is %s and has %s views.\nIt was %s." % (videoTitle, viewCount, videoUpload)) #Prints the information about the video


#Defining the YouTube Channel function
def YoutubeChannel():
    #Making channelLink equal to whatever the user enters as their video link
    channelLink = input ('\nWhat is your channel link? (In quotations, with http included)\n')

    #Goes to the video URL, opens it and reads the HTML file
    htmlfile = urllib.urlopen(channelLink) #Searches for this URL
    htmltext = htmlfile.read() #Reads the HTML file and sets it to htmltext

    #Setup for the channel name
    channelTitle = "<title>(.+?)</title>" #Searches for the title of the video
    patternChannelTitle = re.compile(channelTitle)
    channelTitle = re.findall(patternChannelTitle, htmltext)

    print (channelTitle)



ans  = True
while ans: 
    print ("\n[1] Get information regarding a YouTube video.")
    print ("\n[2] Get information regarding a YouTube channel.")
    print ("\n[Q] Quit the application.")

    ans = raw_input("\nWhat would you like to do now? ")
    if ans == "1":
        YoutubeVideo()
    elif ans == "2":
        YoutubeChannel()
    elif ans == "q":
        sys.exit(0)
    elif ans != "":
        print "Not a valid choice, try again."

1 个答案:

答案 0 :(得分:0)

我不熟悉你用来解析html内容的内容 但你可以使用更容易的BeautifulSoup

import requests
from bs4 import BeautifulSoup

# channel url = https://www.youtube.com/channel/XXXXXX

url = "your channel link"
page = requests.get(url)
plain_text = page.text
soup = BeautifulSoup(plain_text,"html.parser")
span = soup.find('span',{'class' : 'qualified-channel-title-text'})
title =soup.find('a',{'class' : 'spf-link branded-page-header-title-link yt-   uix-sessionlink'})
title = title.get('title')
print(title)

你可以看到我正在使用的html标签 用链接和小图片覆盖整个标题 并且在文本中有一个标题,并使用“spf-link brandded-page-header-title-link yt- uix-sessionlink”这个类 然后我从它获取title属性:)

希望这很有用

请注意,如果你要运行它,你将不得不安装beautifulsoup和请求 那些可以用管道轻松安装