Question

我有file1.txt内容：

rs002
rs113
rs209
rs227
rs151 
rs104

我有file2.txt内容：

rs113   113
rs002   002
rs227   227
rs209   209
rs104   104
rs151   151

我希望获得与file2.txt中的记录匹配的file1.txt行，我尝试过这些行：

grep -Fwf file1.txt file2.txt

输出如下：

rs113   113
rs002   002
rs227   227
rs209   209
rs104   104
rs151   151

这会提取所有匹配的行，但它是file2.txt中出现的顺序。有没有办法在保持file1.txt的顺序的同时提取匹配的记录？所需的输出如下：

rs002   002
rs113   113
rs209   209
rs227   227
rs151   151
rs104   104

Answer 1

一个（非常优雅）解决方案是循环遍历file1.txt并查找每一行的匹配项：

while IFS= read -r line; do
    grep -wF "$line" file2.txt
done < file1.txt

给出输出

rs002   002
rs113   113
rs209   209
rs227   227
rs151   151
rs104   104

如果你知道每一行最多只出现一次，可以通过告诉grep在第一场比赛后停止来加速这一点：

grep -m 1 -wF "$line" file2.txt

据我所知，这是一个GNU扩展。

请注意，循环遍历文件以对每个循环中的另一个文件执行某些处理通常是sign that there is a much more efficient way to do things，因此这应该仅用于足够小的文件，以便提供更好的解决方案比使用此解决方案处理它们更长。

Answer 2

grep这太复杂了。如果file2.txt不是很大，即它适合内存，你应该使用awk：

 awk 'FNR==NR { f2[$1] = $2; next } $1 in f2 { print $1, f2[$1] }' file2.txt file1.txt

输出：

rs002 002
rs113 113
rs209 209
rs227 227
rs151 151
rs104 104

Answer 3

从file2

创建一个sed命令文件

 sed 's#^\([^ ]*\)\(.*\)#/\1/ s/$/\2/#' file2 > tmp.sed
 sed -f tmp.sed file1

这两行可以合并，避免使用tmp文件

sed -f <(sed 's#^\([^ ]*\)\(.*\)#/\1/ s/$/\2/#' file2) file1

Answer 4

这应该有所帮助（但对于大输入不会是最佳的）：

2016-04-09 20:35:19,399 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Cleaning up the staging area /user/ambari-qa/.staging/job_1460043791266_0012
2016-04-09 20:35:19,407 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob - PigLatin:pigSmoke.sh got an error while submitting 
java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1460043791266_0012 to YARN : User: rm/sandbox.hortonworks.com@HDP-SANDBOX is not allowed to impersonate ambari-qa
    at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:306)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:240)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
    at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
    at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:194)
    at java.lang.Thread.run(Thread.java:745)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1460043791266_0012 to YARN : User: rm/sandbox.hortonworks.com@HDP-SANDBOX is not allowed to impersonate ambari-qa
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:271)
    at org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:291)
    at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:290)
    ... 16 more
2016-04-09 20:35:19,410 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1460043791266_0012
2016-04-09 20:35:19,410 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases A,B
2016-04-09 20:35:19,410 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M:

如何从file1匹配file2 grep内容并将它们按file2的顺序放置

4 个答案: