Question

我正在练习熊猫并完成以下任务：

创建一个列表，其元素是每个.csv文件的列数

.csv文件存储在按年份键入的字典directory中

我使用字典理解dataframes（再次按年份键入）将.csv文件存储为pandas dataframes

directory = {2009: 'path_to_file/data_2009.csv', ... , 2018: 'path_to_file/data_2018.csv'}

dataframes = {year: pandas.read_csv(file) for year, file in directory.items()}

# My Approach 1 
columns = [df.shape[1] for year, df in dataframes.items()]

# My Approach 2
columns = [dataframes[year].shape[1] for year in dataframes]

哪种方式更“Pythonic”？或者有更好的方法来解决这个问题吗？

Answer 1

您的方法将完成它...但我不喜欢读取整个文件并创建数据帧只是为了计算列。你可以通过阅读每个文件的第一行并计算逗号的数量来做同样的事情。请注意，我添加了1，因为总有一个逗号少于列。

columns = [open(f).readline().count(',') + 1 for _, f in directory.items()]

Answer 2

你的方法2：

columns = [dataframes[year].shape[1] for year in dataframes]

更加Pythonic，简洁，未来在合并，绘图，操作等方面使用数据帧，因为理解中隐含了键，而形状给出了列数

Answer 3

您可以使用：

columns = [len(dataframe.columns) for dataframe in dataframes.values()]

正如@piRSquared所提到的，如果你唯一的目标是获取数据帧中的列数，你就不应该读取整个csv文件，而是使用read_csv函数的nrows关键字参数。

Answer 4

import os
#use this to find files under certain dir, you can filter it if there are other files
target_files = os.listdir('path_to_file/')       
columns = list()
for filename in train_files:
    #in your scenario @piRSquared's answer would be more efficient.
    columns.append(#column_numbers)

如果您想要从文件名中按年份显示密钥，则可以过滤文件名并更新字典，如下所示：

year = filename.replace(r'[^0-9]', '')

Pythonic方式循环字典

4 个答案: