我正在创建一个脚本来从网站下载一些mp3播客并将它们写入某个位置。我差不多完成了,文件正在下载和创建。但是,我遇到了二进制数据无法完全解码而mp3文件无法播放的问题。
这是我的代码:
import re
import os
import urllib2
from bs4 import BeautifulSoup
import time
def getHTMLstring(url):
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
soupString = soup.encode('utf-8')
return soupString
def getList(html_string):
urlList = re.findall('(http://podcast\.travelsinamathematicalworld\.co\.uk\/mp3/.*\.mp3)', html_string)
firstUrl = urlList[0]
finalList = [firstUrl]
for url in urlList:
if url != finalList[0]:
finalList.insert(0,url)
return finalList
def getBinary(netLocation):
req = urllib2.urlopen(netLocation)
reqSoup = BeautifulSoup(req)
reqString = reqSoup.encode('utf-8')
return reqString
def getFilename(string):
splitTerms = string.split('/')
fileName = splitTerms[-1]
return fileName
def writeFile(sourceBinary, fileName):
with open(fileName, 'wb') as fp:
fp.write(sourceBinary)
def main():
htmlString = getHTMLstring('http://www.travelsinamathematicalworld.co.uk')
urlList = getList(htmlString)
fileFolder = 'D:\\Dropbox\\Mathematics\\Travels in a Mathematical World\\Podcasts'
os.chdir(fileFolder)
for url in urlList:
name = getFilename(url)
binary = getBinary(url)
writeFile(binary, name)
time.sleep(2)
if __name__ == '__main__':
main()
当我运行代码时,我在控制台中收到以下警告:
警告:root:某些字符无法解码,并被替换为REPLACEMENT CHARACTER。
我认为它与我正在使用的数据是用UTF-8编码这一事实有关,而且write方法可能需要不同的编码?我是Python新手(实际上是一般的编程),而且我被卡住了。
答案 0 :(得分:2)
假设您要从网址下载一些mp3文件。
您可以通过BeautifulSoup
检索这些网址。但是您不需要使用BeautifulSoup
来解析网址。你只需要直接保存即可
例如,
url = 'http://acl.ldc.upenn.edu/P/P96/P96-1004.pdf'
res = urllib2.urlopen(url)
with open(fileName, 'wb') as fp:
fp.write(res.read())
如果我使用BeautifulSoup
来解析该pdf网址
reqSoup = BeautifulSoup('http://acl.ldc.upenn.edu/P/P96/P96-1004.pdf')
reqSoup
不是pdf文件,而是HTML响应。实际上,它是
<html><body><p>http://acl.ldc.upenn.edu/P/P96/P96-1004.pdf</p></body></html>