Question

你好，我有一个像这样的数据集：

array([['1;"Female";133;132;124;"118";"64.5";816932'],
       ['2;"Male";140;150;124;".";"72.5";1001121'],
       ['3;"Male";139;123;150;"143";"73.3";1038437'],
       ['4;"Male";133;129;128;"172";"68.8";965353'],
       ['5;"Female";137;132;134;"147";"65.0";951545'],
       ['6;"Female";99;90;110;"146";"69.0";928799'],
       ['7;"Female";138;136;131;"138";"64.5";991305']], dtype=object)

我想将其转换为具有此列的数据框

columns = ["Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"]

NB ：在数组列表中，行值的分隔符为分号（;）。请帮助我将其组织到一个数据列，其中包含列名和数组中的行值

Answer 1

使用DataFrame为新列创建expand=True和Series.str.split：

a = np.array([['1;"Female";133;132;124;"118";"64.5";816932'],
       ['2;"Male";140;150;124;".";"72.5";1001121'],
       ['3;"Male";139;123;150;"143";"73.3";1038437'],
       ['4;"Male";133;129;128;"172";"68.8";965353'],
       ['5;"Female";137;132;134;"147";"65.0";951545'],
       ['6;"Female";99;90;110;"146";"69.0";928799'],
       ['7;"Female";138;136;131;"138";"64.5";991305']], dtype=object)

df = pd.DataFrame(a)[0].str.split(';', expand=True)
df.columns = ['ID',"Gender","FSIQ","VIQ","PIQ","Weight","Height","MRI_Count"]

最后一次数据清理-用Series.str.strip删除了对""的替代，并用to_numeric用DataFrame.apply将列转换为数字：

df['Gender'] = df['Gender'].str.strip('"')
c = ["ID", "FSIQ","VIQ","PIQ","Weight","Height","MRI_Count"]
df[c] = df[c].apply(lambda x: pd.to_numeric(x.str.strip('"'), errors='coerce'))
print (df)
  ID  Gender  FSIQ  VIQ  PIQ  Weight  Height  MRI_Count
0  1  Female   133  132  124   118.0    64.5     816932
1  2    Male   140  150  124     NaN    72.5    1001121
2  3    Male   139  123  150   143.0    73.3    1038437
3  4    Male   133  129  128   172.0    68.8     965353
4  5  Female   137  132  134   147.0    65.0     951545
5  6  Female    99   90  110   146.0    69.0     928799
6  7  Female   138  136  131   138.0    64.5     991305

Answer 2

另一种可能的解决方案是使用io.StringIO和pandas.read_csv。只需join数组中的每个元素都带有一个\n字符：

from io import StringIO

# Setup
a = np.array([['1;"Female";133;132;124;"118";"64.5";816932'],
       ['2;"Male";140;150;124;".";"72.5";1001121'],
       ['3;"Male";139;123;150;"143";"73.3";1038437'],
       ['4;"Male";133;129;128;"172";"68.8";965353'],
       ['5;"Female";137;132;134;"147";"65.0";951545'],
       ['6;"Female";99;90;110;"146";"69.0";928799'],
       ['7;"Female";138;136;131;"138";"64.5";991305']])

columns = ["Gender", "FSIQ", "VIQ", "PIQ", "Weight", "Height", "MRI_Count"]

df = pd.read_csv(StringIO('\n'.join(a.ravel())), header=None,
                 sep=';', names=columns, na_values=['.'])

[出]

   Gender  FSIQ  VIQ  PIQ  Weight  Height  MRI_Count
1  Female   133  132  124   118.0    64.5     816932
2    Male   140  150  124     NaN    72.5    1001121
3    Male   139  123  150   143.0    73.3    1038437
4    Male   133  129  128   172.0    68.8     965353
5  Female   137  132  134   147.0    65.0     951545
6  Female    99   90  110   146.0    69.0     928799
7  Female   138  136  131   138.0    64.5     991305

pandas应该很好地解释dtypes

print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 1 to 7
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Gender     7 non-null      object 
 1   FSIQ       7 non-null      int64  
 2   VIQ        7 non-null      int64  
 3   PIQ        7 non-null      int64  
 4   Weight     6 non-null      float64
 5   Height     7 non-null      float64
 6   MRI_Count  7 non-null      int64  
dtypes: float64(2), int64(4), object(1)
memory usage: 448.0+ bytes

将数组列表转换为数据框

2 个答案: