如何从文本块中删除dupes

时间:2016-02-25 04:38:22

标签: ruby perl awk sed

在文本文件中删除块内的欺骗是一种聪明而简单的方法。每个块由两个换行符分隔。

在:

apple
banana
apple
cherry
cherry

delta
epsilon
delta
epsilon

apple pie
delta
delta

在:

apple
banana
cherry

delta
epsilon

apple pie
delta

感谢。应该在Mac上工作。允许unicode。任何shell方法/语言/命令。 Dupes不一定是连续的。如果忽略前导/尾随空格,可以使用奖励,或者可以使用逗号作为记录中的分隔符。

4 个答案:

答案 0 :(得分:4)

$ awk '!NF{delete seen} !seen[$0]++' file
apple
banana
cherry

delta
epsilon

apple pie
delta

忽略(与删除相反)使用GNU awk for gensub()的前导/尾随空格将是:

$ awk '!NF{delete seen} !seen[gensub(/^\s+|\s+$/,"","g")]++' file

在这种情况下,我不知道can use a comma as the delimiter within a record你的意思。

答案 1 :(得分:0)

RUBY!

text =<<_
apple
banana
apple
cherry
cherry

delta
epsilon
delta
epsilon

apple pie
delta
delta
_

r1 = /
     (?<=\n) # match a newline in a positive lookbehind
     \n      # match a newline
     /x      # extended/free-spacing regex definition mode

r2 = /
     (?<=\n) # match a newline in a positive lookbehind
     /x

puts text.split(r1).map { |s| s.split(r2).uniq.join }.join("\n")
  # apple
  # banana
  # cherry

  # delta
  # epsilon

  # apple pie
  # delta

步骤:

a = text.split(r1)
  #=> ["apple\nbanana\napple\ncherry\ncherry\n",
  #    "delta\nepsilon\ndelta\nepsilon\n",
  #    "apple pie\ndelta\ndelta\n"] 
a.map { |s| s.split(r2) }
  #=> [["apple\n", "banana\n", "apple\n", "cherry\n", "cherry\n"],
  #    ["delta\n", "epsilon\n", "delta\n", "epsilon\n"],
  #    ["apple pie\n", "delta\n", "delta\n"]] 
a.map { |s| s.split(r2).uniq }
  #=> [["apple\n", "banana\n", "cherry\n"],
  #    ["delta\n", "epsilon\n"],
  #    ["apple pie\n", "delta\n"]] 
b = a.map { |s| s.split(r2).uniq.join }
  #=> ["apple\nbanana\ncherry\n",
  #    "delta\nepsilon\n",
  #    "apple pie\ndelta\n"] 
b.join("\n")
  #=> "apple\nbanana\ncherry\n\ndelta\nepsilon\n\napple pie\ndelta\n" 

答案 2 :(得分:0)

这可能适合你(GNU sed):

sed -r ':a;N;s/\b((\S+)\b.*)\n\2$/\1/;/^$/M!ba' file

将线条存储在图案空间(PS)中,直到空白行或文件末尾。读取最后一行和前一行的模式匹配,如果匹配,则删除最后一行。如果最后一行是空行(或文件末尾),则打印PS中保留的所有行。

答案 3 :(得分:0)

假设:

$ cat file
apple
banana
apple
cherry
cherry

delta
epsilon
delta
epsilon

apple pie
delta
delta

您可以使用Ruby的段落模式命令行开关将空行作为每条记录的分隔符,并将字段分隔符设置为每个字段的\n。然后统一每个块:

$ ruby -00 -F'\n' -lane '$><<$F.uniq.join("\n")<<"\n\n"' file
apple
banana
cherry

delta
epsilon

apple pie
delta

说明:

$ ruby -00 -F'\n' -lane '$><<$F.uniq.join("\n")<<"\n\n"'
   ^                                                      # ruby 1.9+ only I think
        ^                                                 # split records by \n\n
            ^                                             # split fields by \n
                   ^                                      # options to:
                                                            -l loop over input
                                                             a auto split
                                                             n don't auto print
                                                             e compile command line
                         ^                                # to STDOUT
                           ^                              # append
                             ^                            # the split fields
                                 ^                        # made uniq
                                     ^                    # join back to a string
                                          ^               # add back the record separator   

或者,您可以使用Ruby哈希来计算字段,然后只打印哈希的键:

$ ruby -00 -F'\n' -lane 'h=Hash.new(0)
                         $F.each {|f| h[f]+=1 }
                         p h
                         puts h.keys.join("\n")<<"\n\n"
                         ' file
{"apple"=>2, "banana"=>1, "cherry"=>2}
apple
banana
cherry

{"delta"=>2, "epsilon"=>2}
delta
epsilon

{"apple pie"=>1, "delta"=>2}
apple pie
delta 

(在ruby 1.9+中,哈希值保持插入顺序 - 这将按文件顺序打印单词。)

然后,如果要向潜在字段分隔符添加,,您可以执行以下操作:

$ ruby -00 -F'\n|,' -lane '$><<$F.uniq.join("\n")<<"\n\n"' file