我正在尝试编写脚本以使用praw
从Reddit下载图像,将图像保存到我选择的文件夹中,并导出.csv
的结果。
自从下载图像以来,我想我已经对它进行了正确编码,当我尝试运行脚本时,我只是收到“数组必须长度相同”错误。
我认为这可能与字典中的“ path”字段有关,但是循环看起来好像它们正确地附加了信息,所以我不知道。我从“路径”中丢失了2个条目,我不知道它们要放在哪里。
我的代码如下:
#! python3
import praw
import pandas as pd
import requests
path = r'C:\\Scripts\\IMG\\'
#Reddit API Tokens
reddit = praw.Reddit(client_id='x', \
client_secret='x', \
user_agent='x', \
username='x', \
password='x')
x_dict = {"id":[], \
"title":[], \
"url":[], \
"path":[]}
submissions = reddit.subreddit('x').hot(limit=100)
for submission in submissions:
x_dict["id"].append(submission.id)
x_dict["title"].append(submission.title)
x_dict["url"].append(submission.url)
if submission.url.endswith(".gifv"):
submission.url = submission.url.replace('.com/', '.com/download/')
submission.url = (submission.url + ".mp4")
r = requests.get(submission.url, allow_redirects=True)
if "gif" in r.headers['Content-Type']:
dir2 = os.path.join(path, submission.id + ".gif")
submission.url = (submission.url + ".gif")
open(dir2, 'wb').write(r.content)
print ("downloading " + submission.id + " to " + dir2)
x_dict["path"].append(dir2)
else:
dir2 = os.path.join(path, submission.id + ".mp4")
open(dir2, 'wb').write(r.content)
print ("downloading " + submission.id + " to " + dir2)
x_dict["path"].append(dir2)
elif "gfycat" in submission.url:
if "https://" in submission.url:
dir2 = os.path.join(path, submission.id + ".mp4")
submission.url = submission.url.replace('https://', 'https://giant.')
submission.url = (submission.url + ".mp4")
r = requests.get(submission.url, allow_redirects=True)
open(dir2, 'wb').write(r.content)
print ("downloading " + submission.id + " to " + dir2)
x_dict["path"].append(dir2)
else:
dir2 = os.path.join(path, submission.id + ".mp4")
submission.url = submission.url.replace('http://', 'http://giant.')
submission.url = (submission.url + ".mp4")
r = requests.get(submission.url, allow_redirects=True)
open(dir2, 'wb').write(r.content)
print ("downloading " + submission.id + " to " + dir2)
x_dict["path"].append(dir2)
elif "i.redd" in submission.url:
if submission.url.endswith(".jpg"):
dir2 = os.path.join(path, submission.id + ".jpg")
r = requests.get(submission.url, allow_redirects=True)
open(dir2, 'wb').write(r.content)
print ("downloading " + submission.id + " to " + dir2)
x_dict["path"].append(dir2)
elif submission.url.endswith(".jpeg"):
dir2 = os.path.join(path, submission.id + ".jpeg")
r = requests.get(submission.url, allow_redirects=True)
open(dir2, 'wb').write(r.content)
print ("downloading " + submission.id + " to " + dir2)
x_dict["path"].append(dir2)
elif submission.url.endswith(".png"):
dir2 = os.path.join(path, submission.id + ".png")
r = requests.get(submission.url, allow_redirects=True)
open(dir2, 'wb').write(r.content)
print ("downloading " + submission.id + " to " + dir2)
x_dict["path"].append(dir2)
elif "v.redd" in submission.url:
dir2 = os.path.join(path, submission.id + ".mp4")
r = requests.get(submission.media['reddit_video']['fallback_url'], allow_redirects=True)
open(dir2, 'wb').write(r.content)
print ("downloading " + submission.id + " to " + dir2)
x_dict["path"].append(dir2)
elif submission.url is None:
print ("\\ " + submission.id + " url is none")
x_dict["path"].append('')
else:
print ("\\" + submission.id + " not supported")
x_dict["path"].append('')
continue
print (len(x_dict["id"]))
print (len(x_dict["title"]))
print (len(x_dict["url"]))
print (len(x_dict["path"]))
x_data = pd.DataFrame(x_dict)
x_data.to_csv(os.path.join(path,'xscrape.csv'))
输出如下
downloading 99rdbf to C:\\Scripts\\IMG\\99rdbf.jpg
100
100
100
98
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-434-0d78dff7cb84> in <module>()
89 print (len(x_dict["url"]))
90 print (len(x_dict["path"]))
---> 91 x_data = pd.DataFrame(x_dict)
92 x_data.to_csv(os.path.join(path,'xscrape.csv'))
d:\Users\localuser\AppData\Local\Continuum\anaconda3\lib\site- packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
346 dtype=dtype, copy=copy)
347 elif isinstance(data, dict):
--> 348 mgr = self._init_dict(data, index, columns, dtype=dtype)
349 elif isinstance(data, ma.MaskedArray):
350 import numpy.ma.mrecords as mrecords
d:\Users\localuser\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in _init_dict(self, data, index, columns, dtype)
457 arrays = [data[k] for k in keys]
458
--> 459 return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
460
461 def _init_ndarray(self, values, index, columns, dtype=None, copy=False):
d:\Users\localuser\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
7313 # figure out the index, if necessary
7314 if index is None:
-> 7315 index = extract_index(arrays)
7316
7317 # don't force copy because getting jammed in an ndarray anyway
d:\Users\localuser\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in extract_index(data)
7359 lengths = list(set(raw_lengths))
7360 if len(lengths) > 1:
-> 7361 raise ValueError('arrays must all be same length')
7362
7363 if have_dicts:
ValueError: arrays must all be same length
答案 0 :(得分:0)
这里的核心问题是您的数据结构设计:它很容易陷入编程错误,而不是帮助防止它们。
在这个答案中,我将使用一个标准的编程技巧:我什至不打算尝试找出当前代码中的问题,而只是重新组织事情以使该问题不再出现。 / p>
在CSV文件中,每一行都是一系列紧密相关的项目。反过来,整个文件是这些行的序列。您希望将更紧密相关的项目在数据结构中保持在一起,因此,两个列表的“内部”数据结构应该是一行中的字段序列,而“外部”数据结构应该是序列中的序列行,这与您所做的相反。
在Python中,有两种非常常见的序列数据结构:list
,您已经知道并在这里使用它;和tuple
,它与list相似但不可更改。
对于此程序,值得学习和理解namedtuple
数据结构,该数据结构是一个元组,但扩展了字段名和构造函数,可确保您始终使用相同数量的参数。后者是另一个数据结构设计决策,它将帮助您避免编程错误。
如下定义CSV行的数据结构:
from collections import namedtuple
Download = namedtuple('Download', 'id title url path')
(值得将其直接输入到Python解释器(python -i
或ipython
)中并进行一些处理,直到您习惯于创建和显示命名元组为止。)
然后,您可以在下载时建立这些列表。由于元组是不可变的,因此我们需要通过一次调用构造函数来构建它,因此只有在拥有所需的所有信息之后才能创建它。然后将其添加到列表中。
def download(id, url):
# All the stuff you need to do an individual download here.
return path
downloads = []
for s in submissions:
path = download(s.id, s.url)
dl = Download.new(s.id, s.title, s.url, path)
downloads.append(dl)
您无需安装Pandas即可编写CSV文件;标准库中有一个csv
模块,可以很好地完成工作。通过其文档中的示例进行工作:
import csv
with open(os.path.join(path,'xscrape.csv'), 'w', newline='') as out:
writer = csv.writer(out)
writer.writerows(downloads)
(这将生成一个没有标题行的CSV文件;添加一个我留给读者的练习)。