我附上了截图以帮助解释。我从克利夫兰心脏数据集中提取了一个数据框,该数据框占用76列并将其放入7列,并将其他列包装到下一行。我试图弄清楚如何使该数据框成为可读的格式,如右侧数据框所示。
变量xyz将始终相同,但我列出的其他字母变量将不同。我以为我可以使用data.loc [:,:'xyz']来开始,但是我不确定从这里去哪里:
data = pd.read_csv("../resources/cleveland.data")
data.loc[:, :'xyz']
然后我将不得不从那里开始并将列名称分配给这些变量。出乎意料的是,一旦我弄清了这一点,培训,测试,验证部分将变得容易得多。先谢谢您的帮助。 (我是菜鸟)
答案 0 :(得分:2)
输入数据
1 a b c
d xyz 2 e
f g h xyz
3 i j k
代码
import pandas as pd
import numpy as np
# The initial data doesn't contain header so set header to None
df = pd.read_csv("../resources/cleveland.data", header=None)
cols = df.columns.tolist()
# Reset the index to get the line number in the durty file
df = df.reset_index()
# After having melt the df, you can filter the df in order to have every values in one column.
# Those values are in the right order
df = pd.melt(df, id_vars=['index'], value_vars=cols)
df = df.sort_values(by=['index', 'variable'])
# Then you can set the line number
df['line'] = np.where(df.value == 'xyz', 1, np.nan)
df.line = df.line.cumsum()
df.line = df.line.bfill()
# If the file doesn't end with 'xyz', we have to set the line number to df.line.max() + 1
df.loc[df.line.isna(), 'line'] = df.line.max() + 1
df.line = df.line.ffill()
# We can set the column names as interger with a groupby cumsum
df['one'] = 1
df['col_name'] = df.groupby(['line'])['one'].cumsum()
df['col_name'] = "col_" + df['col_name'].astype('str')
# Then we can pivot the table
df = df[['value', 'line', 'col_name']]
df = df.pivot(index='line', columns='col_name', values='value')
print(df)
输出数据
col_name col_1 col_2 col_3 col_4 col_5 col_6
line
1.0 1 a b c d xyz
2.0 2 e f g h xyz
3.0 3 i j k NaN NaN
答案 1 :(得分:1)
在将所有值组成一个大数组之后,使用numpy
。 np.array_split
+ np.where
的组合,用于在xyz
之后分割索引:
test.csv
1,a,b,c,d,e,f,g
h,i,j,k,xyz,2,a,b
c,d,e,f,g,h,i,j
k,xyz
import numpy as np
import pandas as pd
arr = pd.read_csv('test.csv', header=None).values.ravel()
pd.DataFrame(np.array_split(arr, np.where(arr == 'xyz')[0]+1)).dropna(how='all')
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1 a b c d e f g h i j k xyz
1 2 a b c d e f g h i j k xyz
来自@CharlesR数据
0 1 2 3 4 5
0 1 a b c d xyz
1 2 e f g h xyz
2 3 i j k None None