我正在尝试从管道定界文件中提取所有(仅)重复值。
我的数据文件有80万行,其中包含多列,因此我对第3列特别感兴趣。因此,我需要获取第3列的重复值,并从该文件中提取所有重复的行。
我可以实现此目标,如下所示。
cat Report.txt | awk -F'|' '{print $3}' | sort | uniq -d >dup.txt
而我将以上内容视为循环,如下所示。
while read dup
do
grep "$dup" Report.txt >>only_dup.txt
done <dup.txt
我也尝试过awk方法
while read dup
do
awk -v a=$dup '$3 == a { print $0 }' Report.txt>>only_dup.txt
done <dup.txt
但是,由于文件中包含大量记录,因此需要很长时间才能完成。因此,我正在寻找一种简便快捷的选择。
例如,我有这样的数据:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
7|learning|Windows|Business|Requirements
8|learning|Mac|Business|Requirements
我的预期输出不包含唯一记录:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
答案 0 :(得分:3)
这可能是您想要的:
$ awk -F'|' 'NR==FNR{cnt[$3]++; next} cnt[$3]>1' file file
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
或者如果文件太大而无法容纳所有键($ 3值)(仅使用800,000行中的唯一$ 3值就不会有问题):
$ cat tst.awk
BEGIN { FS="|" }
{ currKey = $3 }
currKey == prevKey {
if ( !prevPrinted++ ) {
print prevRec
}
print
next
}
{
prevKey = currKey
prevRec = $0
prevPrinted = 0
}
$ sort -t'|' -k3,3 file | awk -f tst.awk
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
答案 1 :(得分:1)
EDIT2: 根据Ed先生的建议,我的建议使用更有意义的数组名称(IMO)进行了微调。
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!unique_check_count[val]++){
numbered_indexed_array[++count]=val
}
actual_valued_array[val]=(actual_valued_array[val]?actual_valued_array[val] ORS:"")$0
line_count_array[val]++
}
END{
for(i=1;i<=count;i++){
if(line_count_array[numbered_indexed_array[i]]>1){
print actual_valued_array[numbered_indexed_array[i]]
}
}
}
' Input_file
埃德·莫顿(Ed Morton)编辑:FWIW这是我在上面的代码中命名变量的方式:
awk '
match($0,/[^\|]*\|/) {
key = substr($0,RSTART+RLENGTH)
if ( !numRecs[key]++ ) {
keys[++numKeys] = key
}
key2recs[key] = (key in key2recs ? key2recs[key] ORS : "") $0
}
END {
for ( keyNr=1; keyNr<=numKeys; keyNr++ ) {
key = keys[keyNr]
if ( numRecs[key]>1 ) {
print key2recs[key]
}
}
}
' Input_file
编辑: :由于OP用|
分隔了Input_file,因此将代码更改如下,从而处理了新的Input_file(感谢Ed Morton先生指出来)。
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
请尝试以下操作,以下操作将按输入文件中出现行的相同顺序给出输出。
awk '
match($0,/[^ ]* /){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
输出如下。
2 learning Unix Business Team
4 learning Unix Business Team
6 learning Unix Business Team
3 learning Linux Business Requirements
5 learning Linux Business Requirements
上述代码的说明:
awk ' ##Starting awk program here.
match($0,/[^ ]* /){ ##Using match function of awk which matches regex till first space is coming.
val=substr($0,RSTART+RLENGTH) ##Creating variable val whose value is sub-string is from starting point of RSTART+RLENGTH value to till end of line.
if(!a[val]++){ ##Checking condition if value of array a with index val is NULL then go further and increase its index too.
b[++count]=val ##Creating array b whose index is increment value of variable count and value is val variable.
} ##Closing BLOCK for if condition of array a here.
c[val]=(c[val]?c[val] ORS:"")$0 ##Creating array named c whose index is variable val and value is $0 along with keep concatenating its own value each time it comes here.
d[val]++ ##Creating array named d whose index is variable val and its value is keep increasing with 1 each time cursor comes here.
} ##Closing BLOCK for match here.
END{ ##Starting END BLOCK section for this awk program here.
for(i=1;i<=count;i++){ ##Starting for loop from i=1 to till value of count here.
if(d[b[i]]>1){ ##Checking if value of array d with index b[i] is greater than 1 then go inside block.
print c[b[i]] ##Printing value of array c whose index is b[i].
}
}
}
' Input_file ##Mentioning Input_file name here.
答案 2 :(得分:1)
另一个awk:
describe('Create a new entry', () => {
it('Should return a success: new entry created', (done) => {
const testEntry1= {
title: 'My title',
description: 'My description'
};
process.env.TOKEN ="the token";
chai.request(app).post('/api/v1/entry')
.set(process.env.TOKEN)
.send(testEntry1)
.end((err, res) => {
expect(res).to.have.status(201);
expect(res.body.data).to.be.a('object');
});
done();
});
});
示例数据的输出:
$ awk -F\| '{ # set delimiter
n=$1 # store number
sub(/^[^|]*/,"",$0) # remove number from string
if($0 in a) { # if $0 in a
if(a[$0]==1) # if $0 seen the second time
print b[$0] $0 # print first instance
print n $0 # also print current
}
a[$0]++ # increase match count for $0
b[$0]=n # number stored to b and only needed once
}' file
还可以吗?
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
或不,因为分隔符从空间更改为管道。$ sort -k 2 file | uniq -D -f 1
或smth。
答案 3 :(得分:0)
两步改进。
第一步:
之后
awk -F'|' '{print $3}' Report.txt | sort | uniq -d >dup.txt
# or
cut -d "|" -f3 < Report.txt | sort | uniq -d >dup.txt
您可以使用
grep -f <(sed 's/.*/^.*|.*|&|.*|/' dup.txt) Report.txt
# or without process substitution
sed 's/.*/^.*|.*|&|.*|/' dup.txt > dup.sed
grep -f dup.sed Report.txt
第二步:
按照其他更好的答案中的方法使用awk
。