我正在使用已更改的数据Adult
,并希望将其另存为csv。但是,将其另存为csv并重新加载数据以供再次使用后,数据无法正确转换。标头未保留,现在合并了一些列。我已经浏览了该页面并在线查看,但是我尝试了一下,但没有用。我使用以下代码加载数据:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
在插入缺失值并根据需要更改数据框后,我尝试了
df = Adult
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
和其他一些变体。如何将文件保存为CSV并保存正确的格式,以供下次读取文件时使用?
重新加载数据时,我使用以下代码:
import pandas as pd
df = pd.read_csv('file_name.csv')
运行df.head
时,输出为:
<bound method NDFrame.head of Unnamed: 0 Unnamed: 0.1 age ... Black Asian-Pac-Islander Other
0 0 0 39 ... 0 0 0
1 1 1 50 ... 0 0 0
2 2 2 38 ... 0 0 0
3 3 3 53 ... 1 0 0
和print(df.loc[:,"age"].value_counts())
的输出是:
36 898
31 888
34 886
23 877
35 876
其中不应包含2列
答案 0 :(得分:2)
如果您pickle it如此:
Adult.to_pickle('adult.pickle')
随后,您将能够使用read_pickle重新读入它,如下所示:
original_adult = pd.read_pickle('adult.pickle')
希望有帮助。
答案 1 :(得分:1)
如果要保留输出列顺序,可以在保存DataFrame时直接指定列:
import pandas as pd
url2 = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
df = pd.read_csv(url2, header=None, skipinitialspace=True)
my_columns = ["age", "workclass", "fnlwgt", "education", "educationnum", "maritalstatus", "occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
df.columns = my_columns
# do the computation ...
df[my_columns].to_csv('file_name.csv')
如果您对保存DataFrame行索引不感兴趣,可以将参数index=False
添加到to_csv('file_name.csv', index=False)
函数中。否则,在再次读取csv文件时,您需要指定index_col
参数。
根据documentation value_counts()
返回一个Series
对象-您会看到两列,因为第一列是索引-Age(36,31,...),而第二个是计数(898、888等)。
答案 2 :(得分:1)
我复制了您的代码,它对我有用。列的顺序被保留。
让我展示我的尝试。尝试了这批代码:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
url2="http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.data" #Reading in Data from a freely and easily
available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data
by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns =["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
这很好。然后
df = Adult
这也起作用。 然后,我将此数据帧保存到一个csv文件中。即使将文件保存在与此脚本相同的文件夹中,也请确保提供该文件的绝对路径。
df.to_csv('full_path_to_the_file.csv',header = True)
# so someting like
#df.to_csv('Users/user_name/Desktop/folder/NameFile.csv',header = True)
将此csv文件加载到new_df中。它将生成一个新列来跟踪索引。不必要,您可以将其删除,如下所示:
new_df = pd.read_csv('Users/user_name/Desktop/folder/NameFile.csv', index_col = None)
new_df= new_df.drop('Unnamed: 0', axis =1)
当我将原始df中new_df的列与这行代码进行比较
new_df.columns == df.columns
我知道
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True])
您可能没有提供文件的绝对路径,也没有保存文件两次。您只需要保存一次。
df.to_csv('file_name.csv',header = True)
df.to_csv('file_name.csv')
答案 3 :(得分:0)
通常保存数据框时,第一列是索引,读取数据框时要加载索引,而且每当将数据框分配给变量时,都要确保复制数据框:
df = Adult.copy()
df.to_csv('file_name.csv',header = True)
阅读:
df = pd.read_csv('file_name.csv', index_col=0)
print(df.loc[:,"age"].value_counts())
的第一列是索引列,如果您查询datframe,则会显示该索引列,要将其保存到列表中,请使用to_list
方法:
print(df.loc[:,"age"].value_counts().to_list())