我需要获取一个CSV文件并拆分行并让它们级联。输入CSV可以具有不同数量的列(始终为偶数),但始终以相同方式拆分。我决定使用Pandas,因为对于一些文件,输出将是500,000行,我认为它会加快速度。
输入:
h1 h2 h3 h4 h5 h6
A1 A2 A3 A4 A5 A6
B1 B2 B3 B4 B5 B6
预期产出
h1 h2 h3 h4 h5 h6
A1 A2
A1 A2 A3 A4
A1 A2 A3 A4 A5 A6
B1 B2
B1 B2 B3 B4
B1 B2 B3 B4 B5 B6
我尝试使用下面的代码(从一些搜索和我自己的编辑拼凑而成),因为你可以看到它很接近,但不是我需要的。
importFile = pd.read_csv('file.csv')
df = df_importFile = pd.DataFrame(importFile)
index_cols = ['h1']
cols = [c for c in df if c not in index_cols]
df2 = df.set_index(index_cols).stack().reset_index(level=1, drop=True).to_frame('Value')
df2 = pd.concat([pd.Series([v if i % len(cols) == n else ''
for i, v in enumerate(df2.Value)], name=col)
for n, col in enumerate(cols)], axis=1).set_index(df2.index)
df2.to_csv('output.csv')
这给出了以下
h1 h2 h3 h4 h5 h6
A1 A2
A1 A3
A1 A4
A1 A5
A1 A6
答案 0 :(得分:3)
# take number of columns and divide by 2
# this is the number of pairs
pairs = df.shape[1] // 2
# np.repeat takes the number of rows and returns an object to slice
# the dataframe array df.values... then slice... result should be
# of length pairs * len(df)
a = df.values[np.repeat(np.arange(df.shape[0]), pairs)]
# row values to condition with as column vector
dim0 = (np.arange(a.shape[0]) % (pairs))[:, None ]
# column values to condition with as row vector
dim1 = np.repeat(np.arange(pairs), 2)
# boolean mask to use in np.where generated
# via the magic of numpy broadcasting
mask = dim0 >= dim1
# QED
pd.DataFrame(np.where(mask, a, ''), columns=df.columns)
答案 1 :(得分:3)
试试这个:
dfNew = pd.DataFrame()
ct = 1
while ct <= df.shape[1]/2 :
dfNew = dfNew.append(df[df.columns[:2*ct]])
ct +=1
dfNew.sort_values(['h1'], ascending=[True]).reset_index(drop=True).fillna("")
print df
h1 h2 h3 h4 h5 h6
0 A1 A2
1 A1 A2 A3 A4
2 A1 A2 A3 A4 A5 A6
3 B1 B2
4 B1 B2 B3 B4
5 B1 B2 B3 B4 B5 B6