我有这段代码,它遍历URL的txt文件并搜索要下载的文件:
URLS = open("urlfile.txt").readlines()
def downloader():
with open('data.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
for url in downloadtools.URLS:
try:
html_data = urlopen(url)
except:
print 'Error opening URL: ' + url
pass
#Creates a BS object out of the open URL.
soup = bs(html_data)
#Parsing the URL for later use
urlinfo = urlparse.urlparse(url)
domain = urlparse.urlunparse((urlinfo.scheme, urlinfo.netloc, '', '', '', ''))
path = urlinfo.path.rsplit('/', 1)[0]
FILETYPE = ['\.pdf$', '\.ppt$', '\.pptx$', '\.doc$', '\.docx$', '\.xls$', '\.xlsx$', '\.wmv$', '\.mp4$', '\.mp3$']
#Loop iterates through list of file types for open URL.
for types in FILETYPE:
for link in soup.findAll(href = compile(types)):
urlfile = link.get('href')
filename = urlfile.split('/')[-1]
while os.path.exists(filename):
try:
fileprefix = filename.split('_')[0]
filetype = filename.split('.')[-1]
num = int(filename.split('.')[0].split('_')[1])
filename = fileprefix + '_' + str(num + 1) + '.' + filetype
except:
filetype = filename.split('.')[1]
fileprefix = filename.split('.')[0] + '_' + str(1)
filename = fileprefix + '.' + filetype
#Creates a full URL if needed.
if '://' not in urlfile and not urlfile.startswith('//'):
if not urlfile.startswith('/'):
urlfile = urlparse.urljoin(path, urlfile)
urlfile = urlparse.urljoin(domain, urlfile)
#Downloads the urlfile or returns error for manual inspection
try:
urlretrieve(urlfile, filename, Percentage)
writer.writerow(['SUCCESS', url, urlfile, filename])
print " SUCCESS"
except:
print " ERROR"
writer.writerow(['ERROR', url, urlfile, filename])
除了未将数据写入CSV之外,一切正常。没有目录被更改(我知道,至少......)
脚本遍历外部URL列表,查找文件,正确下载文件,然后打印成功#34;成功"或"错误"没有问题。它唯一没做的就是将数据写入CSV文件。它将完整地运行而无需编写任何CSV数据。
我尝试在virtualenv中运行它,以确保没有任何奇怪的包问题。
我的嵌入式循环是否会导致CSV数据无法写入?
答案 0 :(得分:2)
请尝试with open('data.csv', 'wb') as csvfile:
。
http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
或者,构建一个可迭代代替writerow
,然后使用writerows
。如果以交互模式运行脚本,则可以查看可迭代行的内容。 (即[[' SUCCESS',...],[' SUCCESS',...],...])
import csv
with open('some.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(someiterable)
答案 1 :(得分:0)
因此,我让脚本完整运行,并且出于某种原因,数据在运行一段时间后开始写入CSV。我不知道如何解释。数据以某种方式存储在内存中并随机开始写入?我不知道,但与终端中打印的日志相比,数据准确无误。
怪异。