我有一个我认为非常有趣的问题。我已经收集了大量的网页抓取链接,我想从普通链接下载内容,所以我在抓取阶段忽略了所有带扩展名的链接,如.PDF,.avi,.jpeg等。
所以我有一个没有扩展名的抓链接列表,但是当我开始时 下载内容,其中一些结果是PDF,音乐文件,图像或MS Word文档。我怎么能忽略它们并预见到"隐藏"下载内容前链接的扩展名?
示例:
PDF:http://www.komunala-radovljica.si/library/includes/file.asp?FileId=168
PDF:http://www.hyundai.si/files/9861/HY-Mursak15_204x280-Motorevija_TISK.pdf?download (这里我应该寻找字符串" .PDF"在链接中)
MS Word:http://www.plinarna-maribor.si/bin?bin.svc=obj&bin.id=2D7F844C-C294-34B6-CECC-A65C2ADCF92A
MP4:http://www.hyundai.si/files/9865/Hyundai_Hokej_Mursak_Zvok_17sek_MP4.mp4?download (这里我应该寻找字符串" MP4"在链接中)
CSS:http://global.careers.ppg.com/CMSPages/GetResource.ashx?stylesheetname=CPJobsLayout
我的代码:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
#
# DOWNLOADER
# To grab the text content of webpages and save it to TinyDB database.
import re, time, urllib, requests, bs4
from bs4 import BeautifulSoup
start_time = time.time()
# Open file with urls.
with open("Q:/SIIT/JV_Marko_Boro/Detector/test_podjetja_2015/podjetja_0_100_url_test.txt") as f:
urls = f.readlines()
# Open file to write content to.
with open("Q:/SIIT/JV_Marko_Boro/Detector/test_podjetja_2015/podjetja_0_100_vsebina_test.txt", 'wb') as v:
# Read the urls one by one
for url in urls[0:len(urls)]:
# HTTP
if str(url)[0:7] == "http://":
print "URL " + str(url)
# Read the HTML of url
soup = BeautifulSoup(urllib.urlopen(url).read(), "html.parser")
# EXTRACT TEXT
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text().encode('utf-8')
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
# manually insert Slavic characters
text = text.replace('ÄŤ', 'č')
text = text.replace('ÄŤ', 'č')
text = text.replace('ÄŚ', 'Č')
text = text.replace('Ĺľ', 'ž')
text = text.replace('Ĺľ', 'ž')
text = text.replace('Ĺ˝', 'Ž')
text = text.replace('Ĺ˝', 'Ž')
text = text.replace('š', 'š')
text = text.replace('š', 'š')
text = text.replace('Ĺ ', 'Š')
text = text.replace('Â', '')
text = text.replace('–', '')
# Write url to file.
v.write(url)
# Write delimiter between url and text
v.write("__delimiter_*_between_*_url_*_and_*_text__")
v.write(text)
# Delimiter to separate contents. Stupid way of writing content to file but due to problems with čšž characters ...
v.write("__delimiter_*_between_*_two_*_webpages__")
# HTTPS
elif str(url)[0:8] == "https://":
print "URL " + str(url)
r = requests.get(url, verify=True)
html = r.text.encode('utf-8')
#soup = BeautifulSoup(html, "lxml")
soup = BeautifulSoup(html, "html.parser")
# EXTRACT TEXT
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text().encode('utf-8')
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
# manually insert Slavic characters
text = text.replace('ž', 'ž')
text = text.replace('Ž', 'Ž')
text = text.replace('Å¡', 'š')
text = text.replace('Å ', 'Š')
text = text.replace('Ä', 'č')
#text = text.replace('•', '')
# Write url to file.
v.write(url)
# Write delimiter between url and text
v.write("__delimiter_*_between_*_url_*_and_*_text__")
v.write(text)
# Delimiter to separate contents. Stupid way of writing content to file but due to problems with čšž characters ...
v.write("__delimiter_*_between_*_two_*_webpages__")
else:
print "URL ERROR"
print "--- %s seconds ---" % round((time.time() - start_time),2)