我在一个文件夹中有一堆文件,我正在阅读每个文件(第一列是单词,第二列是数字)。它们看起来像这样 -
file1 file2
a 2 a 3
b 3 b 1
c 1
so the output would be -
freq file_freq
a 5 2
b 4 2
c 1 1
要解释输出a的第二列是2,因为它在两个文件中都出现而c是1,因为它只出现在file1中。第一列是系统调用的总次数(a,b,c)出现在文件中。
代码的一部分 -
while line:
words=line.split(" ")
if words[0] in df.index:
df.(words[0],'frequency')=int(words[1])+df.(words[0],'frequency')
df.(words[0],'file_frequency')=df.(words[0],'file_frequency')+1
else:
df.loc[-1] = [words[0],words[1],1]
因此我正在寻找在数据帧中找到的if system_call更新频率(应该是+ =)。我正在寻找它在熊猫中的等价物。
编辑 - 我试过
df[words[0]]['frequency'] += words[1]
df[words[0]]['file_frequency'] += 1
但我得到了KeyError: 'clock_gettime'
答案 0 :(得分:1)
由于您使用的是pandas
,因此您可以分两步执行此任务:
pd.concat
将输入文件中的数据合并到一个数据框中。groupby
操作,并进行2次计算。这是一个演示。
# read dataframes; in your code, you can use pd.read_csv
df1 = pd.DataFrame([['a', 2], ['b', 3], ['c', 1]])
df2 = pd.DataFrame([['a', 3], ['b', 1]])
# concatenate dataframes
df = pd.concat([df1, df2], ignore_index=True)
# perform groupby with 2 calculations
res = df.groupby(0)[1].agg({'freq': 'sum', 'file_freq': len})
print(res)
freq file_freq
0
a 5 2
b 4 2
c 1 1
答案 1 :(得分:0)
您可以使用:
from collections import Counter
import glob
#add /*.* for read all files
currentdir = 'path/*.*'
#create 2 counters
c1 = Counter()
c2 = Counter()
#loop by files
for file in glob.glob(currentdir):
print (file)
with open(file) as f:
for line in f:
#split by rsplit - right split by first whitespace
k, v = line.rsplit(' ', 1)
#remove traling whitesapces
k, v = k.strip(), v.strip()
#get counts
c1[k] += 1
#get sums
c2[k] += int(v)
#create final DataFrame only once by counters
df = (pd.DataFrame({'frequency':c2, 'file_frequency':c1})
.rename_axis('system_call')
.reset_index())
print (df)
system_call frequency file_frequency
0 a 5 2
1 b 4 2
2 c 1 1
另一个更慢的解决方案是:
import glob
#add /*.* for read all files
currentdir = 'path/*.*'
n = ['system_call','val']
#create list of all DataFrames from csv
df = pd.concat([pd.read_csv(f, sep='\s+',header=None,names=n) for f in glob.glob(currentdir)])
print (df)
system_call val
0 a 2
1 b 3
2 c 1
0 a 3
1 b 1
#aggregate sum and count
df = (df.groupby('system_call')['val']
.agg([('freq', 'sum'), ('file_freq', 'size')])
.reset_index())
print (df)
system_call freq file_freq
0 a 5 2
1 b 4 2
2 c 1 1