我已经尝试到可以读取一个文件夹中的多个文本文件,将它们填充到数据帧(一个数据帧)中,然后给出以下输出的地步,但是我在如何更改此内容方面仍处于挣扎中格式化为所需的输出(如下图所示):
Name Col2 Col3 Freq File_Path
b h e 43 xyz/fgghh/something_1.txt
g j k 432 xyz/fgghh/something_1.txt
n q e 6 xyz/fgghh/something_1.txt
p p t 3 xyz/fgghh/something_1.txt
uu l x 1 xyz/fgghh/something_1.txt
x r u 23 xyz/fgghh/something_1.txt
b h e 43 xyz/fgghh/something_2.txt
ll e e 1 xyz/fgghh/something_2.txt
n e e 6 xyz/fgghh/something_2.txt
p e e 3 xyz/fgghh/something_2.txt
x y z 23 xyz/fgghh/something_2.txt
zz j k 432 xyz/fgghh/something_2.txt
b h e 43 xyz/fgghh/something.txt
g j k 432 xyz/fgghh/something.txt
n e e 6 xyz/fgghh/something.txt
p e e 3 xyz/fgghh/something.txt
u e e 1 xyz/fgghh/something.txt
yyyy y z 23 xyz/fgghh/something.txt
import pandas as pd
import os
import glob
dirpath= "......"
filenames = glob.glob("...../*.tsv")
list_of_dfs = [pd.read_csv(filename,sep='\t') for filename in filenames]
for dataframe, filename in zip(list_of_dfs, filenames):
dataframe['File_Path'] = filename
combined_df = pd.concat(list_of_dfs, ignore_index=True,sort=False)
out_df=combined_df.pivot_table(index='Name', columns='File_Path')
out_df.to_csv(os.path.join(dirpath,'myMerged_file_2.txt'), sep='\t', encoding='utf-8',quoting=0,index=False,index_label=None)
out_df=combined_df.pivot_table(index='Name', columns='File_Path')
这仍然不起作用。我只想要输出中的“名称”列和频率值
我不确定如何在此文件上使用merge或concat命令以使输出看起来像(所需的输出):
Name something.txt something_1.txt something_2.txt
yyyy 23
b 43 43 43
g 432 432
p 3 3 3
u 1
n 6 6 6
x 23 23
uu 1
zz 432
ll 1
答案 0 :(得分:1)
首先,使用os.path.basename
从文件路径中提取文件名。然后,您可以使用groupby
,first
和unstack
:
import os
(df.groupby([df.Name, df.File_Path.map(os.path.basename)], sort=False)
.Freq.first()
.unstack(1, fill_value=''))
File_Path something_1.txt something_2.txt something.txt
Name
b 43 43 43
g 432 432
n 6 6 6
p 3 3 3
uu 1
x 23 23
ll 1
zz 432
u 1
yyyy 23
在哪里
df.File_Path.map(os.path.basename)
0 something_1.txt
1 something_1.txt
2 something_1.txt
3 something_1.txt
4 something_1.txt
5 something_1.txt
6 something_2.txt
7 something_2.txt
8 something_2.txt
9 something_2.txt
10 something_2.txt
11 something_2.txt
12 something.txt
13 something.txt
14 something.txt
15 something.txt
16 something.txt
17 something.txt
Name: File_Path, dtype: object
另一个选择是使用crosstab
:
(pd.crosstab(index=df.Name,
columns=df.File_Path.map(os.path.basename),
values=df.Freq,
aggfunc='sum')
.fillna(''))
File_Path something.txt something_1.txt something_2.txt
Name
b 43 43 43
g 432 432
ll 1
n 6 6 6
p 3 3 3
u 1
uu 1
x 23 23
yyyy 23
zz 432