Question

我有一个文件列表，其中每个文件都有两列。第一列包含单词，第二列包含数字。

我想从文件中提取所有唯一的单词，并将其中的数字相加。我能做到这一点......

第二项任务是计算找到单词的文件数。我在这部分遇到麻烦......我正在使用字典。

这是我的代码：

& 3

例如我有两个文件 -

import os
from typing import TextIO

currentdir = " " #CHANGE INPUT PATH
resultdir = " " #CHANGE OUTPUT ACCORDINGLY

if not os.path.exists(resultdir):
    os.makedirs(resultdir)

systemcallcount ={}    
for root, dirs, files in os.walk(currentdir):
    for name in files:


        outfile2 = open(root+"/"+name,'r')
        for line in outfile2:
            words=line.split(" ")
            if words[0] not in systemcallcount:
                systemcallcount[words[0]]=int(words[1]) 
            else:
                systemcallcount[words[0]]+=int(words[1]) 



        outfile2.close()


for keys,values in systemcallcount.items():
    print(keys)
    print(values)

要解释第二列输出a是2，因为它在两个文件中都出现，而c是1，因为它只出现在file1中。

Answer 1

一种方法是使用collections.defaultdict。您可以创建set个单词，然后为每个单词增加每个文件的字典计数器。

from collections import defaultdict

d = defaultdict(int)

for root, dirs, files in os.walk(currentdir):
    for name in files:

        with open(root+'/'+name,'r') as outfile2:
            words = {line.split()[0] for line in outfile2}
            for word in words:
                d[words[0]] += 1

Answer 2

我希望这会有所帮助

此代码采用字符串并在文件夹中检查包含它的文件

# https://www.opentechguides.com/how-to/article/python/59/files-containing-text.html

search_string="python"
search_path="C:\Users\You\Desktop\Project\Files"
extension="txt" # files extension

# loop through files in the path specified
for fname in os.listdir(search_path):
    if fname.endswith(file_type):
        # Open file for reading
        fo = open(search_path + fname)
        # Read the first line from the file
        line = fo.readline()
        # Initialize counter for line number
        line_no = 1
        # Number of files found is 0
        files_no=0;
        # Loop until EOF
        while line != '' :
            # Search for string in line
            index = line.find(search_str)
            if ( index != -1) :
                # print the occurence
                print(fname, "[", line_no, ",", index, "] ", line, sep="")
                # Read next line
                line = fo.readline()  
                # Increment line counter
                line_no += 1
                # Increment files counter
                files_no += 1
                # Close the files
                fo.close()

Answer 3

另一种方法是使用Pandas来处理你的两个任务。

将文件读入表格
请在单独的列中注明源文件。
应用函数以获取唯一单词，对数字求和，并计算每个单词的源文件。

以下是代码：

import pandas as pd
import sys,os

files = os.listdir(currentdir)

dfs = []
for f in files:
    df = pd.read_csv(currentdir+"/"+f,sep='\t',header=None)
    df['source_file'] = f
    dfs.append(df)

def concat(x):
     return pd.Series(dict(A = x[0].unique()[0], 
                        B = x[1].sum(), 
                        C = len(x['source_file'])))    

df = pd.concat(dfs,ignore_index=True).groupby(0).apply(concat)

# Print result to standard output
df.to_csv(sys.stdout,sep='\t',header=None,index=None)

您可以在此处参考：Pandas groupby: How to get a union of strings

Answer 4

您似乎想要将文件解析为列表字典，以便为您提供的输入：

file1  file2
a  2    a 3
b  3    b 1 
c  1

...解析后得到以下数据结构：

{'a': [2, 3], 'b': [3, 1], 'c': [1]}

由此，您可以轻松获得所需的一切。

使用defaultdict：

解析这种方式应该相当简单

parsed_data = defaultdict(list)

for filename in list_of_filenames:
    with open(filename) as f:
        for line in f:
            name, number = line.split()
            parsed_data[name].append(int(number))

之后，打印您感兴趣的数据应该是微不足道的：

for name, values in parsed_data.items():
    print('{} {} {}'.format(name, sum(values), len(values)))

该解决方案假定同一名称中的相同名称不会出现两次。没有说明在这种情况下会发生什么。

TL; DR：您的问题的解决方案是defaultdict。

向字典键添加多个值

4 个答案: