Question

我在一个文件夹中有一堆文件，我正在阅读每个文件（第一列是单词，第二列是数字）。它们看起来像这样 -

    file1  file2
    a  2    a 3
    b  3    b 1 
    c  1     

    so the output would be -
       freq    file_freq
    a   5          2
    b   4          2
    c   1          1

要解释输出a的第二列是2，因为它在两个文件中都出现而c是1，因为它只出现在file1中。第一列是系统调用的总次数（a，b，c）出现在文件中。

代码的一部分 -

 while line:
            words=line.split(" ")
            if words[0] in df.index:
                df.(words[0],'frequency')=int(words[1])+df.(words[0],'frequency')
                df.(words[0],'file_frequency')=df.(words[0],'file_frequency')+1

            else:
                df.loc[-1] = [words[0],words[1],1]

因此我正在寻找在数据帧中找到的if system_call更新频率（应该是+ =）。我正在寻找它在熊猫中的等价物。

编辑 - 我试过

df[words[0]]['frequency'] += words[1]
df[words[0]]['file_frequency'] += 1

但我得到了KeyError: 'clock_gettime'

Answer 1

由于您使用的是pandas，因此您可以分两步执行此任务：

使用pd.concat将输入文件中的数据合并到一个数据框中。
根据需要执行单个groupby操作，并进行2次计算。

这是一个演示。

# read dataframes; in your code, you can use pd.read_csv
df1 = pd.DataFrame([['a', 2], ['b', 3], ['c', 1]])
df2 = pd.DataFrame([['a', 3], ['b', 1]])

# concatenate dataframes
df = pd.concat([df1, df2], ignore_index=True)

# perform groupby with 2 calculations
res = df.groupby(0)[1].agg({'freq': 'sum', 'file_freq': len})

print(res)

   freq  file_freq
0                 
a     5          2
b     4          2
c     1          1

Answer 2

您可以使用：

from collections import Counter
import glob

#add /*.* for read all files
currentdir = 'path/*.*'

#create 2 counters
c1 = Counter()
c2 = Counter()

#loop by files
for file in glob.glob(currentdir):
    print (file)

    with open(file) as f:
        for line in f:
           #split by rsplit - right split by first whitespace
           k, v = line.rsplit(' ', 1)
           #remove traling whitesapces
           k, v = k.strip(), v.strip()
           #get counts
           c1[k] += 1
           #get sums
           c2[k] += int(v)

#create final DataFrame only once by counters
df = (pd.DataFrame({'frequency':c2, 'file_frequency':c1})
       .rename_axis('system_call')
       .reset_index())
print (df)
  system_call  frequency  file_frequency
0           a          5               2
1           b          4               2
2           c          1               1

另一个更慢的解决方案是：

import glob

#add /*.* for read all files
currentdir = 'path/*.*'

n = ['system_call','val']
#create list of all DataFrames from csv
df = pd.concat([pd.read_csv(f, sep='\s+',header=None,names=n) for f in glob.glob(currentdir)])
print (df)
  system_call  val
0           a    2
1           b    3
2           c    1
0           a    3
1           b    1

#aggregate sum and count
df = (df.groupby('system_call')['val']
        .agg([('freq', 'sum'), ('file_freq', 'size')])
        .reset_index())
print (df)
  system_call  freq  file_freq
0           a     5          2
1           b     4          2
2           c     1          1

+ =更新pandas datadame

2 个答案: