Question

当我在尝试大熊猫时，我注意到了pandas.read_csv的一些奇怪行为，并且想知道是否有更多经验的人可以解释可能导致它的原因。

首先，这是我从.csv文件创建新pandas.dataframe的基本类定义：

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath  # File path to the target .csv file.
        self.csvfile = open(filepath)  # Open file.
        self.csvdataframe = pd.read_csv(self.csvfile)

现在，这非常有效，并且在我的__ main __.py中调用该类成功创建了一个pandas数据帧：

From dataMatrix.py import dataMatrix

testObject = dataMatrix('/path/to/csv/file')

但是我注意到这个过程自动将.csv的第一行设置为pandas.dataframe.columns索引。相反，我决定对列进行编号。由于我不想假设我事先了解了列数，因此我采用了打开文件的方法，将其加载到数据帧中，对列进行计数，然后使用range重新加载具有适当列数的数据帧（）。

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath
        self.csvfile = open(filepath)

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(self.csvfile)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)
        # Re-load the .csv file, manually setting the column names to their 
        # number.
        self.csvdataframe = pd.read_csv(self.csvfile, 
                                        names=range(self.numcolumns))

保持我在__ main __.py中的处理方式相同，我找回了一个具有正确列数（在这种情况下为500）的数据帧，并且具有正确的名称（0 ... 499），但它是空的（否）行数据）。

抓住我的头，我决定关闭self.csv文件并重新加载它：

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath
        self.csvfile = open(filepath)

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(self.csvfile)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)

        # Close the .csv file.         #<---- +++++++
        self.csvfile.close()           #<----  Added
        # Re-open file.                #<----  Block
        self.csvfile = open(filepath)  #<---- +++++++

        # Re-load the .csv file, manually setting the column names to their
        # number.
        self.csvdataframe = pd.read_csv(self.csvfile, 
                                        names=range(self.numcolumns))

关闭文件并重新打开它，并使用pandas.dataframe正确返回，其中包含编号为0 ... 499的列以及所有255个后续数据行。

我的问题是为什么关闭文件并重新打开它会产生影响？

Answer 1

使用

打开文件时

open(filepath)

返回文件句柄 iterator 。迭代器适用于通过其内容的一次传递。所以

self.csvdataframe = pd.read_csv(self.csvfile)

读取内容并耗尽迭代器。对pd.read_csv的后续调用认为迭代器为空。

请注意，只需将文件路径传递给pd.read_csv：

即可避免此问题

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(filepath)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)


        # Re-load the .csv file, manually setting the column names to their
        # number.
        self.csvdataframe = pd.read_csv(filepath, 
                                        names=range(self.numcolumns))

然后

pd.read_csv将为您打开（并关闭）该文件。

PS。另一个选项是通过调用self.csvfile.seek(0)将文件句柄重置为文件的开头，但使用pd.read_csv(filepath, ...)仍然更容易。

更好的是，您可以重命名列，而不是两次调用pd.read_csv（效率低下）：

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(filepath)
        self.numcolumns = len(self.csvdataframe.columns)
        self.csvdataframe.columns = range(self.numcolumns)

在打开的文件上使用Pandas read_csv（）两次

1 个答案: