Question

我正在使用已更改的数据Adult，并希望将其另存为csv。但是，将其另存为csv并重新加载数据以供再次使用后，数据无法正确转换。标头未保留，现在合并了一些列。我已经浏览了该页面并在线查看，但是我尝试了一下，但没有用。我使用以下代码加载数据：

import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",  
                 "relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
                 "less50kmoreeq50kn"]

在插入缺失值并根据需要更改数据框后，我尝试了

df = Adult

df.to_csv('file_name.csv',header = True)

df.to_csv('file_name.csv')

和其他一些变体。如何将文件保存为CSV并保存正确的格式，以供下次读取文件时使用？

重新加载数据时，我使用以下代码：

import pandas as pd
df = pd.read_csv('file_name.csv')

运行df.head时，输出为：

<bound method NDFrame.head of        Unnamed: 0  Unnamed: 0.1  age  ... Black  Asian-Pac-Islander Other
0               0             0   39  ...     0                   0     0
1               1             1   50  ...     0                   0     0
2               2             2   38  ...     0                   0     0
3               3             3   53  ...     1                   0     0

和print(df.loc[:,"age"].value_counts())的输出是：

其中不应包含2列

Answer 1

如果您pickle it如此：

Adult.to_pickle('adult.pickle')

随后，您将能够使用read_pickle重新读入它，如下所示：

original_adult = pd.read_pickle('adult.pickle')

希望有帮助。

Answer 2

如果要保留输出列顺序，可以在保存DataFrame时直接指定列：

import pandas as pd

url2 = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" 
df = pd.read_csv(url2, header=None, skipinitialspace=True)

my_columns = ["age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation",
             "relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
             "less50kmoreeq50kn"]
df.columns = my_columns

# do the computation ...

df[my_columns].to_csv('file_name.csv')

如果您对保存DataFrame行索引不感兴趣，可以将参数index=False添加到to_csv('file_name.csv', index=False)函数中。否则，在再次读取csv文件时，您需要指定index_col参数。

根据documentation value_counts()返回一个Series对象-您会看到两列，因为第一列是索引-Age（36，31，...），而第二个是计数（898、888等）。

Answer 3

我复制了您的代码，它对我有用。列的顺序被保留。

让我展示我的尝试。尝试了这批代码：

import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *

url2="http://archive.ics.uci.edu/ml/machine-learning- 
databases/adult/adult.data" #Reading in Data from a freely and easily 
available source on the internet

Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data 
by removing extra spaces in cplumns with skipinitialspace=True

##Assigning reasonable column names to the dataframe
Adult.columns =["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",  
             "relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
             "less50kmoreeq50kn"]

这很好。然后

df = Adult

这也起作用。然后，我将此数据帧保存到一个csv文件中。即使将文件保存在与此脚本相同的文件夹中，也请确保提供该文件的绝对路径。

df.to_csv('full_path_to_the_file.csv',header = True)
# so someting like
#df.to_csv('Users/user_name/Desktop/folder/NameFile.csv',header = True)

将此csv文件加载到new_df中。它将生成一个新列来跟踪索引。不必要，您可以将其删除，如下所示：

new_df = pd.read_csv('Users/user_name/Desktop/folder/NameFile.csv', index_col = None)
new_df= new_df.drop('Unnamed: 0', axis =1)

当我将原始df中new_df的列与这行代码进行比较

new_df.columns == df.columns

我知道

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
    True,  True,  True,  True,  True,  True])

您可能没有提供文件的绝对路径，也没有保存文件两次。您只需要保存一次。

df.to_csv('file_name.csv',header = True)

df.to_csv('file_name.csv')

Answer 4

通常保存数据框时，第一列是索引，读取数据框时要加载索引，而且每当将数据框分配给变量时，都要确保复制数据框：

df = Adult.copy()
df.to_csv('file_name.csv',header = True)

阅读：

df = pd.read_csv('file_name.csv', index_col=0)

print(df.loc[:,"age"].value_counts())的第一列是索引列，如果您查询datframe，则会显示该索引列，要将其保存到列表中，请使用to_list方法：

print(df.loc[:,"age"].value_counts().to_list())

将数据帧转换为CSV文件

4 个答案: