Question

我有一个我认为非常有趣的问题。我已经收集了大量的网页抓取链接，我想从普通链接下载内容，所以我在抓取阶段忽略了所有带扩展名的链接，如.PDF，.avi，.jpeg等。

所以我有一个没有扩展名的抓链接列表，但是当我开始时下载内容，其中一些结果是PDF，音乐文件，图像或MS Word文档。我怎么能忽略它们并预见到＆＃34;隐藏＆＃34;下载内容前链接的扩展名？

示例：

PDF：http://www.komunala-radovljica.si/library/includes/file.asp?FileId=168

PDF：http://www.hyundai.si/files/9861/HY-Mursak15_204x280-Motorevija_TISK.pdf?download （这里我应该寻找字符串＆＃34; .PDF＆＃34;在链接中）

MS Word：http://www.plinarna-maribor.si/bin?bin.svc=obj&bin.id=2D7F844C-C294-34B6-CECC-A65C2ADCF92A

图片：http://www.ddmaribor.si/index.php/fotografije/70-lep-literarnoglasbeni-vecer-s-ferijem-lainsckom/detail/1874-lep-literarnoglasbeni-vecer-s-ferijem-lainsckom?tmpl=component&phocadownload=2

MP4：http://www.hyundai.si/files/9865/Hyundai_Hokej_Mursak_Zvok_17sek_MP4.mp4?download （这里我应该寻找字符串＆＃34; MP4＆＃34;在链接中）

CSS：http://global.careers.ppg.com/CMSPages/GetResource.ashx?stylesheetname=CPJobsLayout

我的代码：

#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8  
#
# DOWNLOADER
# To grab the text content of webpages and save it to TinyDB database.

import re, time, urllib, requests, bs4
from bs4 import BeautifulSoup 

start_time = time.time()




# Open file with urls.
with open("Q:/SIIT/JV_Marko_Boro/Detector/test_podjetja_2015/podjetja_0_100_url_test.txt") as f:
    urls = f.readlines()



# Open file to write content to.
with open("Q:/SIIT/JV_Marko_Boro/Detector/test_podjetja_2015/podjetja_0_100_vsebina_test.txt", 'wb') as v:


    # Read the urls one by one 
    for url in urls[0:len(urls)]:


        # HTTP
        if str(url)[0:7] == "http://":

            print "URL  " + str(url)
            # Read the HTML of url
            soup = BeautifulSoup(urllib.urlopen(url).read(), "html.parser") 

            # EXTRACT TEXT
            # kill all script and style elements
            for script in soup(["script", "style"]):
                script.extract()    # rip it out
            # get text
            text = soup.get_text().encode('utf-8')  
            # break into lines and remove leading and trailing space on each
            lines = (line.strip() for line in text.splitlines())
            # break multi-headlines into a line each
            chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
            # drop blank lines
            text = '\n'.join(chunk for chunk in chunks if chunk)

            # manually insert Slavic characters
            text = text.replace('Ă„Ĺ¤', 'č')
            text = text.replace('ÄŤ', 'č')
            text = text.replace('ÄŚ', 'Č')

            text = text.replace('ÄąÄľ', 'ž')
            text = text.replace('Ĺľ', 'ž')
            text = text.replace('ÄąËť', 'Ž')
            text = text.replace('Ĺ˝', 'Ž')

            text = text.replace('Ĺˇ', 'š')
            text = text.replace('ÄąË‡', 'š')
            text = text.replace('Ĺ ', 'Š')

            text = text.replace('Â', '')
            text = text.replace('â€“', '')

            # Write url to file.
            v.write(url)
            # Write delimiter between url and text
            v.write("__delimiter_*_between_*_url_*_and_*_text__")

            v.write(text)
            # Delimiter to separate contents. Stupid way of writing content to file but due to problems with čšž characters ...
            v.write("__delimiter_*_between_*_two_*_webpages__")



        # HTTPS
        elif str(url)[0:8] == "https://":


            print "URL  " + str(url)

            r = requests.get(url, verify=True)
            html = r.text.encode('utf-8')
            #soup = BeautifulSoup(html, "lxml")
            soup = BeautifulSoup(html, "html.parser")

            # EXTRACT TEXT
            # kill all script and style elements
            for script in soup(["script", "style"]):
                script.extract()    # rip it out
            # get text
            text = soup.get_text().encode('utf-8')
            # break into lines and remove leading and trailing space on each
            lines = (line.strip() for line in text.splitlines())
            # break multi-headlines into a line each
            chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
            # drop blank lines
            text = '\n'.join(chunk for chunk in chunks if chunk)

            # manually insert Slavic characters
            text = text.replace('Å¾', 'ž')
            text = text.replace('Å½', 'Ž')
            text = text.replace('Å¡', 'š')
            text = text.replace('Å ', 'Š')
            text = text.replace('Ä', 'č')

            #text = text.replace('â€˘', '')

            # Write url to file.
            v.write(url)
            # Write delimiter between url and text
            v.write("__delimiter_*_between_*_url_*_and_*_text__")

            v.write(text)
            # Delimiter to separate contents. Stupid way of writing content to file but due to problems with čšž characters ...
            v.write("__delimiter_*_between_*_two_*_webpages__")

        else:
            print "URL ERROR"

print "--- %s seconds ---" % round((time.time() - start_time),2)

如何在下载内容之前查看链接扩展名？

0 个答案: