我有file1.txt
内容:
rs002
rs113
rs209
rs227
rs151
rs104
我有file2.txt
内容:
rs113 113
rs002 002
rs227 227
rs209 209
rs104 104
rs151 151
我希望获得与file2.txt
中的记录匹配的file1.txt
行,我尝试过这些行:
grep -Fwf file1.txt file2.txt
输出如下:
rs113 113
rs002 002
rs227 227
rs209 209
rs104 104
rs151 151
这会提取所有匹配的行,但它是file2.txt
中出现的顺序。有没有办法在保持file1.txt
的顺序的同时提取匹配的记录?所需的输出如下:
rs002 002
rs113 113
rs209 209
rs227 227
rs151 151
rs104 104
答案 0 :(得分:2)
一个(非常优雅)解决方案是循环遍历file1.txt
并查找每一行的匹配项:
while IFS= read -r line; do
grep -wF "$line" file2.txt
done < file1.txt
给出输出
rs002 002
rs113 113
rs209 209
rs227 227
rs151 151
rs104 104
如果你知道每一行最多只出现一次,可以通过告诉grep在第一场比赛后停止来加速这一点:
grep -m 1 -wF "$line" file2.txt
据我所知,这是一个GNU扩展。
请注意,循环遍历文件以对每个循环中的另一个文件执行某些处理通常是sign that there is a much more efficient way to do things,因此这应该仅用于足够小的文件,以便提供更好的解决方案比使用此解决方案处理它们更长。
答案 1 :(得分:2)
grep
这太复杂了。如果file2.txt
不是很大,即它适合内存,你应该使用awk
:
awk 'FNR==NR { f2[$1] = $2; next } $1 in f2 { print $1, f2[$1] }' file2.txt file1.txt
输出:
rs002 002
rs113 113
rs209 209
rs227 227
rs151 151
rs104 104
答案 2 :(得分:0)
从file2
创建一个sed命令文件 sed 's#^\([^ ]*\)\(.*\)#/\1/ s/$/\2/#' file2 > tmp.sed
sed -f tmp.sed file1
这两行可以合并,避免使用tmp文件
sed -f <(sed 's#^\([^ ]*\)\(.*\)#/\1/ s/$/\2/#' file2) file1
答案 3 :(得分:-1)
这应该有所帮助(但对于大输入不会是最佳的):
2016-04-09 20:35:19,399 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Cleaning up the staging area /user/ambari-qa/.staging/job_1460043791266_0012
2016-04-09 20:35:19,407 [JobControl] INFO org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob - PigLatin:pigSmoke.sh got an error while submitting
java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1460043791266_0012 to YARN : User: rm/sandbox.hortonworks.com@HDP-SANDBOX is not allowed to impersonate ambari-qa
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:306)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:240)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:194)
at java.lang.Thread.run(Thread.java:745)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1460043791266_0012 to YARN : User: rm/sandbox.hortonworks.com@HDP-SANDBOX is not allowed to impersonate ambari-qa
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:271)
at org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:291)
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:290)
... 16 more
2016-04-09 20:35:19,410 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1460043791266_0012
2016-04-09 20:35:19,410 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases A,B
2016-04-09 20:35:19,410 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: