你好,我有一个像这样的数据集:
array([['1;"Female";133;132;124;"118";"64.5";816932'],
['2;"Male";140;150;124;".";"72.5";1001121'],
['3;"Male";139;123;150;"143";"73.3";1038437'],
['4;"Male";133;129;128;"172";"68.8";965353'],
['5;"Female";137;132;134;"147";"65.0";951545'],
['6;"Female";99;90;110;"146";"69.0";928799'],
['7;"Female";138;136;131;"138";"64.5";991305']], dtype=object)
我想将其转换为具有此列的数据框
columns = ["Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"]
NB :在数组列表中,行值的分隔符为分号(;)。请帮助我将其组织到一个数据列,其中包含列名和数组中的行值
答案 0 :(得分:2)
使用DataFrame
为新列创建expand=True
和Series.str.split
:
a = np.array([['1;"Female";133;132;124;"118";"64.5";816932'],
['2;"Male";140;150;124;".";"72.5";1001121'],
['3;"Male";139;123;150;"143";"73.3";1038437'],
['4;"Male";133;129;128;"172";"68.8";965353'],
['5;"Female";137;132;134;"147";"65.0";951545'],
['6;"Female";99;90;110;"146";"69.0";928799'],
['7;"Female";138;136;131;"138";"64.5";991305']], dtype=object)
df = pd.DataFrame(a)[0].str.split(';', expand=True)
df.columns = ['ID',"Gender","FSIQ","VIQ","PIQ","Weight","Height","MRI_Count"]
最后一次数据清理-用Series.str.strip
删除了对""
的替代,并用to_numeric
用DataFrame.apply
将列转换为数字:
df['Gender'] = df['Gender'].str.strip('"')
c = ["ID", "FSIQ","VIQ","PIQ","Weight","Height","MRI_Count"]
df[c] = df[c].apply(lambda x: pd.to_numeric(x.str.strip('"'), errors='coerce'))
print (df)
ID Gender FSIQ VIQ PIQ Weight Height MRI_Count
0 1 Female 133 132 124 118.0 64.5 816932
1 2 Male 140 150 124 NaN 72.5 1001121
2 3 Male 139 123 150 143.0 73.3 1038437
3 4 Male 133 129 128 172.0 68.8 965353
4 5 Female 137 132 134 147.0 65.0 951545
5 6 Female 99 90 110 146.0 69.0 928799
6 7 Female 138 136 131 138.0 64.5 991305
答案 1 :(得分:2)
另一种可能的解决方案是使用io.StringIO
和pandas.read_csv
。只需join
数组中的每个元素都带有一个\n
字符:
from io import StringIO
# Setup
a = np.array([['1;"Female";133;132;124;"118";"64.5";816932'],
['2;"Male";140;150;124;".";"72.5";1001121'],
['3;"Male";139;123;150;"143";"73.3";1038437'],
['4;"Male";133;129;128;"172";"68.8";965353'],
['5;"Female";137;132;134;"147";"65.0";951545'],
['6;"Female";99;90;110;"146";"69.0";928799'],
['7;"Female";138;136;131;"138";"64.5";991305']])
columns = ["Gender", "FSIQ", "VIQ", "PIQ", "Weight", "Height", "MRI_Count"]
df = pd.read_csv(StringIO('\n'.join(a.ravel())), header=None,
sep=';', names=columns, na_values=['.'])
[出]
Gender FSIQ VIQ PIQ Weight Height MRI_Count
1 Female 133 132 124 118.0 64.5 816932
2 Male 140 150 124 NaN 72.5 1001121
3 Male 139 123 150 143.0 73.3 1038437
4 Male 133 129 128 172.0 68.8 965353
5 Female 137 132 134 147.0 65.0 951545
6 Female 99 90 110 146.0 69.0 928799
7 Female 138 136 131 138.0 64.5 991305
pandas
应该很好地解释dtypes
print(df.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 1 to 7
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 7 non-null object
1 FSIQ 7 non-null int64
2 VIQ 7 non-null int64
3 PIQ 7 non-null int64
4 Weight 6 non-null float64
5 Height 7 non-null float64
6 MRI_Count 7 non-null int64
dtypes: float64(2), int64(4), object(1)
memory usage: 448.0+ bytes