Question

我目前正在开发一个项目，我必须加载一个更大的.csv文件（250万行），并在每一行上进行一些错误处理。

到目前为止，我将我的.csv文件加载到“dataFrame”变量中：

#Load the datafile into DataFrame
dataFrame = pd.read_csv(filename,header=None,
    names=["year", "month", "day", "hour", "minute", "second", "zone1", "zone2", "zone3", "zone4"])

然后我正在遍历dataFrame中的每一行并执行我的错误处理，例如：

#Check rows for corrupted measurements
for i in range(len(dataFrame)+1):

    #Define the row
    try:    
        row = np.array(dataFrame.iloc[i,:], dtype=object)
    except IndexError:
        continue

    #If condition to check if there are corrupted measurements
    if not -1 in row:
        continue

    #Check fmode, ignore upper- or lowercase
    #foward fill
    if fmode.lower() in fmodeStr[0]:
        (Error handling)

    elif fmode.lower() in fmodeStr[1]:
        (Error handling)

    elif fmode.lower() in fmodeStr[2]:
        (Error handling)

其中fmode只是一个字符串，用于指定用户想要执行的错误处理。

截至目前，该代码使用了相当数量的行（1000-5000）。但是当.csv文件有50万行时，它需要很长时间才能完成。这是非常明显的，因为我循环遍历每行，一百五十万行文件。

我想知道哪种解决方案最适合加载这个大小的csv文件，同时对各行进行一些操作？

到目前为止，我已经研究过了： - 使生成器函数加载.csv文件的1行，处理它，并将其保存在numpy矩阵中

使用chunksize选项加载.csv文件并在最后连接
矢量计算（但是，错误处理包括在损坏的行之前或之后用有效行替换损坏的行）

也许你可以做上面的组合？无论如何，谢谢你的时间:)）

对于那些感兴趣/需要更多说明的人，以下是完整的代码：https://github.com/danmark2312/Project-Electricity/blob/test/functions/dataLoad.py

Python - 处理csv文件中的许多行

0 个答案: