Question

我有两个文件：

文件1：

key.1   10    6 
key.2    5    6
key.3.   5    8
key.4.   5    10
key.5    4    12

文件2：

key.1   10    6 
key.2    6    6
key.4    5    10
key.5    2    8

我有一个相当复杂的问题。我想在每个loc的两个文件之间取平均值。 ID。但是如果ID对于任一文件是唯一的，我只想将该值保留在输出文件中。所以输出文件看起来像这样：

key.1   10    6 
key.2   5.5   6
key.3.   5    8
key.4.   5    10
key.5    3    10

这是一个例子。实际上我有100个我想要平均的列。

Answer 1

以下解决方案使用Pandas，并假设您的数据存储在纯文本文件'file1.txt'和'file2.txt'中。如果这个假设不正确，请告诉我 - 这可能是针对不同文件类型进行修改的最小编辑。如果我误解了您对“文件”这个词的含义并且您的数据已经存在于DataFrame中，则可以忽略第一步。

首先将数据读入DataFrames：

import pandas as pd
df1 = pd.read_table('file1.txt', sep=r'\s+', header=None)
df2 = pd.read_table('file2.txt', sep=r'\s+', header=None)

给我们：

In [9]: df1
Out[9]: 
       0   1   2
0  key.1  10   6
1  key.2   5   6
2  key.3   5   8
3  key.4   5  10
4  key.5   4  12

In [10]: df2
Out[10]: 
       0   1   2
0  key.1  10   6
1  key.2   6   6
2  key.4   5  10
3  key.5   2   8

然后在第0列加入这些数据集：

combined = pd.merge(df1, df2, 'outer', on=0)

，并提供：

       0  1_x  2_x   1_y   2_y
0  key.1   10    6  10.0   6.0
1  key.2    5    6   6.0   6.0
2  key.3    5    8   NaN   NaN
3  key.4    5   10   5.0  10.0
4  key.5    4   12   2.0   8.0

这有点乱，但我们只能在计算后选择我们想要的列：

combined[1] = combined[['1_x', '1_y']].mean(axis=1)
combined[2] = combined[['2_x', '2_y']].mean(axis=1)

仅选择有用的列：

results = combined[[0, 1, 2]]

这给了我们：

       0     1     2
0  key.1  10.0   6.0
1  key.2   5.5   6.0
2  key.3   5.0   8.0
3  key.4   5.0  10.0
4  key.5   3.0  10.0

我相信你正在寻找的是什么。

您没有说明您希望输出的文件格式，但以下内容将为您提供以制表符分隔的文本文件。如果有不同的东西，请告诉我，我可以编辑。

results.to_csv('output.txt', sep='\t', header=None, index=False)

我应该补充说，最好给你的列提供相关的标签，而不是像我在这个例子中那样使用数字 - 我只是在这里使用了默认的整数值，因为我对你的数据集一无所知。

Answer 2

您可以使用itertools.groupby：

import itertools
import re
file_1 = [[re.sub('\.$', '', a), *list(map(int, filter(None, b)))] for a, *b in [re.split('\s+', i.strip('\n')) for i in open('filename.txt')]]
file_2 = [[re.sub('\.$', '', a), *list(map(int, filter(None, b)))] for a, *b in [re.split('\s+', i.strip('\n')) for i in open('filename1.txt')]]
special_keys ={a for a, *_ in [re.split('\s+', i.strip('\n')) for i in open('filename.txt')]+[re.split('\s+', i.strip('\n')) for i in open('filename2.txt')] if a.endswith('.')}
new_results = [[a, [c for _, *c in b]] for a, b in itertools.groupby(sorted(file_1+file_2, key=lambda x:x[0])[1:], key=lambda x:x[0])]
last_results = [(" "*4).join(["{}"]*3).format(a+'.' if a+'.' in special_keys else a, *[sum(i)/float(len(i)) for i in zip(*b)]) for a, b in new_results]

输出：

['key.1    10.0    6.0', 'key.2    5.5    6.0', 'key.3.    5.0    8.0', 'key.4.    5.0    10.0', 'key.5    3.0    10.0']

Answer 3

这是pandas的一个解决方案。我们的想法是为每个数据框定义索引，并使用^ [相当于symmetric_difference术语中的set]来查找您的唯一索引。

通过2次pd.concat电话单独处理每个案例，执行groupby.mean，并在最后附加您的孤立索引。

# read files into dataframes
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')

# set first column as index
df1 = df1.set_index(0)
df2 = df2.set_index(0)

# calculate symmetric difference of indices
x = df1.index ^ df2.index
# Index(['key.3'], dtype='object', name=0)

# aggregate common and unique indices
df_common = pd.concat((df1[~df1.index.isin(x)], df2[~df2.index.isin(x)]))
df_unique = pd.concat((df1[df1.index.isin(x)], df2[df2.index.isin(x)]))

# calculate mean on common indices; append unique indices
mean = df_common.groupby(df_common.index)\
                .mean()\
                .append(df_unique)\
                .sort_index()\
                .reset_index()

# output to csv
mean.to_csv('out.csv', index=False)

<强>结果

       0     1     2
0  key.1  10.0   6.0
1  key.2   5.5   6.0
2  key.3   5.0   8.0
3  key.4   5.0  10.0
4  key.5   3.0  10.0

Answer 4

一种可能的解决方案是将两个文件读入字典（键是键变量，值是后面带有两个元素的列表）。然后，您可以获取每个字典的键，查看哪些键是重复的（如果是，将结果平均），以及哪些键是唯一的（如果是，则只输出键）。这可能不是最有效的，但如果您只有数百列应该是最简单的方法。

查找设置交集并设置差异，因为它们将帮助您查找常用项和唯一项。

平均文件之间的值，但保持不匹配的值

4 个答案: