我有一个数据框(UITabBarController
),其中列df
有很多行,并且有行带有公用字符串(col1
)并以不同的数字结尾({{1 }}。我想提取两个字符串(从Collection of numbers are
到001, 002, 005
)之间的行,并将它们分配给具有相同行名(Collection of numbers are 002
)的新列
Collection of numbers are 003
我要将上面的数据框转换为以下格式。
Collection of numbers are 002
注意:没有重复的数字
答案 0 :(得分:4)
我们可以尝试ffill
并使用str.split
进行一些基本的重设
df['headers'] = df['col1'].str.extract('(Collection.*)').ffill()
df1 = df[~df['col1'].str.contains('Collection')].copy()
df1.groupby('headers').agg(','.join)['col1'].str.split(',',expand=True).T.rename_axis('',axis='columns')
退出:
Collection of numbers are 002 Collection of numbers are 003 \
0 53 236
1 20 325
2 56 None
Collection of numbers are 005
0 96
1 23
2 63
答案 1 :(得分:1)
您可以使用set_index
和unstack
。我窃取了@Datanovice提取将来的列的名称的想法,并使用groupby.cumcount
获得了将来的索引号:
arrCollection = df['col1'].str.extract('(Collection.*)').ffill()[0].to_numpy()
df_f = df.set_index([df.groupby(arrCollection)['col1'].cumcount()-1,
arrCollection])['col1']\
.unstack().iloc[1:,:]
print (df_f)
Collection 002 Collection 003 Collection 005
0 53 236 96
1 20 325 23
2 56 NaN 63
注意:列名将与您的示例类似,我没有使用完全相同的输入
答案 2 :(得分:0)
在
col1
0 c of numbers are 002
1 1
2 2
3 3
4 c of numbers are 003
5 55
6 66
7 c of numbers are 005
8 45
9 23
10 12
11 456
12 56
for_concat = []
col = []
for i,r in df.iterrows():
if "numbers" in str(r["col1"]):
if col:
for_concat.append(pd.DataFrame(col,columns=[col_name]))
col_name = r["col1"]
col = []
else:
col_name = r["col1"]
else:
col.append(r["col1"])
for_concat.append(pd.DataFrame(col,columns=[col_name]))
out = pd.concat(for_concat, axis =1)
退出:
c of numbers are 002 c of numbers are 003 c of numbers are 005
0 1.0 55.0 45
1 2.0 66.0 23
2 3.0 NaN 12
3 NaN NaN 456
4 NaN NaN 56
答案 3 :(得分:0)
Datanovic提供的答案似乎不错。 另一种解决方案是使用以下功能:
Wrong format
因此,使用示例数据框,您在调用函数def extract_columns(df, column, common_string):
df_list = df[column].tolist()
df_new = pd.DataFrame()
row_indices = []
cols = []
for ind, elem in enumerate(df_list):
if common_string in str(elem):
row_indices.append(ind)
cols.append(elem)
row_indices.append(len(df_list))
for ind, col in enumerate(cols):
df_new[col] = pd.Series(df_list[row_indices[ind]+1:row_indices[ind+1]])
return df_new
extract_columns(df, 'col1', 'Collection of numbers are')