Question

我有2个文件。这两个文件都具有以下内容：

file1.csv:

label,text,is_valid
negative,"hi there",False
negative,"hello hi",False


file2.csv:

label,text,is_valid
negative,"hi there",False
negative,"hello hi",False
... 1000 such rows

当我对它们执行pd.read_csv('filex.csv')并创建与file1，file2对应的df1, df2时，执行dfx.info()时得到以下内容

df1.info（）：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
label       2 non-null int64
text        2 non-null object
is_valid    2 non-null bool
dtypes: bool(1), int64(1), object(1)
memory usage: 114.0+ bytes

df2.info（）：

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 450 to 647
Data columns (total 3 columns):
label       1000 non-null object
text        1000 non-null object
is_valid    1000 non-null bool
dtypes: bool(1), object(2)
memory usage: 24.4+ KB

文件1是我创建的，文件2是我从别人那里获得的。它们在内容上看起来相似，但是，当对它们执行pd.read_csv时，df info（）对于它们每个都不同。我需要将文件传递到将在文件上调用pd.read_csv('file.csv', heade='infer')的库。换句话说，我无法明确指定dtype, etc。如何确保我可以生成文件1，以使从文件1生成的df与df2生成的格式相同？

熊猫read_csv无法为相似文件推断相同的元数据

0 个答案: