Question

我在文件1中有一个电子邮件地址列表，我正在尝试在文件2中查找这些电子邮件地址（来自文件1）。如果电子邮件地址在文件2中，我希望它返回上面的行，这是他们的用户名。例如：

这是文件1：

test@test.com
bob@test.com
sally@test.com
eve@test.com

这是文件2：

testing
test@test.com
robert
bob@test.com
sally
sally@test.com
eve92
eve@test.com

我希望输出为：

testing
robert
sally
eve92

我正在调查awk，但似乎无法弄明白。有关如何最好地做到这一点的任何想法？愿意通过bash或python或者你认为最好的方式来做。谢谢！

Answer 1

这是实现您想要的强大而有效的方式：

$ awk 'NR==FNR{a[$1];next} NR%2{prev=$0;next} $1 in a{print prev}' file1 file2
testing
robert
sally
eve92

它从电子邮件地址中删除前导/尾随空白，对两个文件中的整个电子邮件地址进行字符串（而不是正则表达式）匹配，并且仅将电子邮件地址与文件2的每个第2行进行比较，因此没有机会错误匹配，错过真正匹配的可能性为零。

Answer 2

假设列表无序且file2不是太大，为第二个文件构建字典似乎是一个不错的选择：

users = {}
with open("file2") as file2:
    try:
        email = ""  # initialize
        while True:
            while "@" not in email:
                username = email
                email = file2.next().strip()
            users[email] = username
            email = ""
            username = ""
    except StopIteration:
        pass
print users
result = []
with open("file1") as file1:
    try:
        for line in file1:
            result.append(users[line.strip()])
    except StopIteration:
        pass

result将包含O(m+n)时间和O(m)空间的用户名列表（对于file2的字典）

Answer 3

这应该有效：grep -B1 -F -f file1 file2

-B1：比赛前获得1行（GNU grep）
-F：固定字符串匹配，而不是正则表达式 -f：在case中加载来自file = file1的模式
file2：要应用grep并获取上一行（B1）的文件

更新：
经过一些测试，这个解决方案有一个bug：grep返回两行匹配模式的一行=电子邮件
模式匹配前的另一行=用户名。

由于-B1 grep运算符，模式匹配前的行首先出现。

只获取第一行=用户名而不是获得模式匹配（第二行）的简单方法是：

grep -B1 -F -f file1 file2 |grep -v "@"

考虑到用户名不包括＆＃34; @＆＃34; 。

Answer 4

首先使用file2中的所有前一行创建一个数组第一步的每个命令都以next结束现在解析file1（NR将大于FNR）并在数组中查找它们。

awk 'NR==FNR{a[$1]=x;x=$0;next} $1 in a{print a[$1]}' file2 file1

Answer 5

@Neemaximo：试试：

awk 'FNR==NR{if($0 !~ /\.com/){A[++i]=$0;next};B[$0];next} ($0 in B){print A[++q]}' file2 file1
OR
awk 'FNR==NR{if($0 !~ /\.com/){A[++i]=$0;next};B[$0]=A[i];;next} ($0 in B){print B[$0]}' file2 file1

考虑您的Input_files仅与每个显示的样本相同。所以上面的解决方案检查FNR == NR条件，只有在读取第一个名为file2的文件时才会为TRUE，这样它将检查一行是否不等于.com然后它创建一个带有变量I的数组A的数组项，其值为是当前行，然后它创建一个数组B，其值为当前行（并且它不等于.com）到数组A的索引为变量I的值，接下来将避免执行所有其他语句。现在检查何时读取名为file1的第二个文件，检查数组B中是否存在当前行，然后打印数组B的值。

Answer 6

如果电子邮件是唯一的（可以肯定的话），您可以使用文件2构建一个数组，然后使用文件1对其进行索引：

$ awk 'NR==FNR{getline l; arr[l]=$1; next} $1 in arr {print arr[$1]}' f2 f1
testing
robert
sally
eve92

在Python中，您可以这样做：

with open("f2") as f2:
    keys=[(next(f2).strip(), k.strip()) for k in f2]

with open("f1") as f1:
    emails=[e.strip() for e in f1]

for e in emails:
    for t in keys:
        if t[0]==e:
            print t[1]

支持重复条目。如果您知道自己的电子邮件地址是唯一的，那么效率会更高：

with open("f2") as f2:
    keys={next(f2).strip(): k.strip() for k in f2}      

with open("f1") as f1:
    for e in f1:
        e=e.strip()
        print keys.get(e, "{} not found".format(e))

这与awk计划基本相同。

匹配2个文件中的字符串，如果匹配则返回文件2中的上一行

6 个答案: