Question

编辑：底部的解决方案。

我正在做一个项目，我必须每周将成千上万张图片存储到hadoop集群中，以便以后进行分析。我想将它们存储到HBase中，并找到此nice pipeline来完成。在使用HBase编写之前，我编写了一个程序，将图像转换为字节并将其存储到数据帧中。问题是，当我从数据框中检索图像时，文件大小大于原始文件，而我找不到原因。

我正在处理的图像约为50kB，并另存为jpg格式。这是将数据转换并存储到数据框中的代码：

#list_files contain a list with all the files' paths
list_bytes=[] #list for images in bytes
for i in range (0, len(list_files)):
    image_original = cv2.imread(list_files[i]) #get the image i from the file list
    flatten = image_original.flatten() #flatten the array for compression
    compress = bz2.compress(flatten) #bzip into bytes
    image_64bytes = base64.b64encode(compress) #change it to 64bytes
    list_bytes.append(image_64bytes) 
df=pd.DataFrame({'file':list_files, 'bytes':list_bytes}) #write images into a dataframe along with their metadata

现在这是从df检索图像的代码：

decode = base64.b64decode(df.iloc[0,0])  #decode from 64bytes to bytes
unzip = bz2.decompress(decode) #unzip
conversion = np.frombuffer(unzip, dtype=np.uint8) #transform bytes into np.array
image_final = np.reshape(conversion, (650, 700,3)) #reshape image in its original format

要验证image_final与image_original相同，以下代码应返回一个空数组：

print((np.where((image_original == image_final ) == False)))

（array（[]，dtype = int64），array（[]，dtype = int64），array（[]， dtype = int64））

然后，我比较了以数据帧中存储的字节为单位的图像大小，似乎它比原始大小大（50kB）。我想这是可以预期的，但仍然有很大的区别。

sys.getsizeof(df.iloc[0,0])

382129

类似地，如果我用image_final将cv2.imwrite(file_path, image_final)保存在磁盘上，则该文件的jpg为80kB，png为550kB。如果image_original和image_final相同，为什么它们在磁盘上的大小不同？稍后将所有图像加载进行分析时，肯定会出现问题。

预先感谢您的帮助

注意：我也尝试使用cv2.imencode('.png', image_original)[1] / cv2.imdecode(conversion, cv2.IMREAD_COLOR)代替image_original.flatten() / np.reshape(conversion, (650, 700,3))，但结果非常相似。

编辑：无需加载图像并将其转换为字节，而是可以将文件读取为字节并将其保存到数据帧中：

#list_files contain a list with all the files' paths
list_bytes=[]
for i in range (0, len(list_files)):
    in_file = open(list_files[i], "rb") # opening for [r]eading as [b]inary
    data = in_file.read() #insert bytes data into data
    compress = bz2.compress(data) #compress the data in bytes
    to_64bytes = base64.b64encode(compress) #change bytes to bytes64
    to_str = to_64bytes.decode() #transform as string for storage
    in_file.close()
    list_bytes.append(to_str) 
df=pd.DataFrame({'file':list_files, 'bytes':list_bytes}) #write it into a database with metadata

然后阅读图像：

s= df.iloc[0,1] #cell containing the string of the data to retrieve

decode = base64.b64decode(s) #transforms to byte64
unzip = bz2.decompress(decode) #unzip
conversion = np.frombuffer(unzip, dtype=np.uint8) #transform into np.array
img = cv2.imdecode(conversion, cv2.IMREAD_COLOR) #transform into img

plt.imshow(img)
plt.show()

图像压缩/解压缩更改文件大小

0 个答案: