使用Python 2.7从网页获取下载链接

时间:2016-06-28 09:04:21

标签: python filter web-scraping beautifulsoup

所以我正在制作这个程序,让重复的任务不那么烦人。它假设采取一个链接,过滤下一个"下载STV演示"按钮,从该按钮抓取网址并使用它下载。从url下载文件工作正常,我只是无法打开网址。它将从stackoverflow下载,而不是我想要的网站。我收到403 Forbidden错误。任何人都有关于如何让它在http://sizzlingstats.com/stats/479453上工作以及过滤下载stv按钮的想法?

import random, sys, urllib2, httplib2, win32clipboard, requests, urlparse
from copy import deepcopy
from bs4 import SoupStrainer
from bs4 import BeautifulSoup
from urllib2 import Request
from urllib2 import urlopen
#When I wrote this, only God and I knew what I was writing
#Now only God knows

page = raw_input("Please copy the .ss link and hit enter... ")
win32clipboard.OpenClipboard()
page = win32clipboard.GetClipboardData()
win32clipboard.CloseClipboard()
s = page
try:
    page = s.replace("http://","http://www.")
    print page + " Found..."
except:
    page = s.replace("www.","http://www.")
    print page

req = urllib2.Request(page, '', headers = { 'User-Agent' : 'Mozilla/5.0' })
req.headers['User-agent'] = 'Mozilla/5.0'
req.add_header('User-agent', 'Mozilla/5.0')
print req
soup = BeautifulSoup(page, 'html.parser')
print soup.prettify()
links = soup.find_all("Download STV Demo")
for tag in links:
    link = links.get('href',None)
    if "Download STV Demo" in link:
        print link

file_name = page.split('/')[-1]
u = urllib2.urlopen(page)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break
    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print status,
f.close()

2 个答案:

答案 0 :(得分:1)

让我们看看你的代码: 首先你要导入许多你不使用的模块(也许这不是空洞代码)和你使用的其他模块,但你不需要它们,实际上你只需要:

from urllib2 import urlopen

(您稍后会看到原因)并且可能win32clipboard输入,您的输入正常,所以我将留下这部分代码:

import win32clipboard
page = input("Please copy the .ss link and hit enter... ")
win32clipboard.OpenClipboard()
page = win32clipboard.GetClipboardData()
win32clipboard.CloseClipboard()

但是我真的没有看到这些输入的目的,是不是更容易使用这样的东西:

page = raw_input("Please enter the .ss link: ")

然后这部分代码真的没必要:

s = page
try:                                            
    page = s.replace("http://","http://www.")   
    print page + " Found..."                   
except:                                             
    page = s.replace("www.","http://www.")      
    print page   

所以我会删除它,下一部分应该是这样的:

from urllib2 import Request, urlopen
from bs4 import BeautifulSoup
req = Request(page, headers = { 'User-Agent' : 'Mozilla/5.0' })
#req.headers['User-agent'] = 'Mozilla/5.0'      # you don't need this
#req.add_header('User-agent', 'Mozilla/5.0')    # you don't need this
print req
html = urlopen(req)        #you need to open page with urlopen before using BeautifulSoup
# it is to fix this error:
##      UserWarning: "b'http://www.sizzlingstats.com/stats/479453'" looks like a URL.
##      Beautiful Soup is not an HTTP client. You should probably use an HTTP client
##      to get the document behind the URL, and feed that document to Beautiful Soup.
soup = BeautifulSoup(html, 'html.parser')   # variable page changed to html
# print soup.prettify()         # I commented this because you don't need to print html
                                # but if you want to see that it's work just uncomment it

我不会使用这段代码,我将解释原因,但是如果你需要使用BeautifulSoup刮掉另一页,那么你可以使用它。

因为这部分你不需要它:

links = soup.find_all("Download STV Demo")

所以问题是html代码中没有“下载STV演示”,至少没有在html代码中,因为页面是由javascript创建的,所以你想找到任何链接,你可以使用print(links)来看到links == [],因此你也不需要这个:

for tag in links:                     
    link = links.get('href',None)      like I said there is no use of this
    if "Download STV Demo" in link:    because variable links is empty list
       print link

所以就像我说的页面的一部分我们需要的链接是用javascript创建的,所以你可以抓取脚本来找到它,但要做到这一点要困难得多,但是如果你看看网址我们正试图找到它看起来像这样:

http://sizzlingstv.s3.amazonaws.com/stv/479453.zip

所以现在看看你的网址,它看起来像这样:

http://sizzlingstats.com/stats/479453

要获取此链接http://sizzlingstv.s3.amazonaws.com/stv/479453.zip,您只需要找到链接的最后一部分,在这种情况下它是479453,并且您拥有它链接(http://sizzlingstats.com/stats/479453),它也是它的最后一部分。您甚至可以将该数字用作file_name。这是完全相同的代码:

file_name = page.split('/')[-1]
download_link = 'http://sizzlingstv.s3.amazonaws.com/stv/' + file_name  + '.zip'
之后我会复制你的一些代码:

u = urlopen(download_link)
meta = u.info()    
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

以下部分有效:

f = open(file_name + '.zip', 'wb')    # I added '.zip'
file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break
    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print status
f.close()

也许你想看到下载邮件,但我认为它更容易使用:

f = open(file_name + '.zip', 'wb') 
f.write(u.read())
print "Downloaded" 
f.close()

这里只是代码:

from urllib2 import urlopen

import win32clipboard
page = input("Please copy the .ss link and hit enter... ")
win32clipboard.OpenClipboard()
page = win32clipboard.GetClipboardData()
win32clipboard.CloseClipboard()

# or use:
# page = raw_input("Please enter the .ss link: ")

file_name = page.split('/')[-1]
download_link = 'http://sizzlingstv.s3.amazonaws.com/stv/' + file_name  + '.zip'
u = urlopen(download_link)
meta = u.info()    
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

f = open(file_name + '.zip', 'wb')    # I added '.zip'
file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break
    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print(status)
f.close()

# or use:
##f = open(file_name + '.zip', 'wb') 
##f.write(u.read())
##print "Downloaded" 
##f.close()

答案 1 :(得分:0)

该页面的内容是通过他们的API动态生成的。

>>> import requests
>>>
>>> requests.get('http://sizzlingstats.com/api/stats/479453').json()['stats']['stvUrl']
u'http://sizzlingstv.s3.amazonaws.com/stv/479453.zip'

您正在获取403,因为他们正在阻止用户代理。

您已经使用用户代理创建了一个req对象,但是您使用urllib2.urlopen(page)代替了它。

您还将page传递给BeautifulSoup,这是一个错误。

soup = BeautifulSoup(page, 'html.parser')