我有一个四列的CSV文件,使用@
作为分隔符,例如:
0001 @ fish @ animal @ eats worms
第一列是唯一保证唯一的列。
我需要对第2,3和4列执行四种排序操作。
首先,第2列按字母数字排序。这种重要特征是它必须保证第2列中的任何重复条目彼此相邻,例如:
@ a @ @
@ a @ @
@ a @ @
@ a @ @
@ a @ @
@ b @ @
@ b @ @
@ c @ @
@ c @ @
@ c @ @
@ c @ @
@ c @ @
接下来,在第一个排序中,将行分为两类。第一行是那些在第4列中任何地方都不包含“arch。”,“var。”,“ver。”,“anci。”或“fam。”的行。第二行(在后面排序),是包含这些词的那些,例如:
@ a @ @ Does not have one of those words.
@ a @ @ Does not have one of those words.
@ a @ @ Does not have one of those words.
@ a @ @ Does not have one of those words.
@ a @ @ This sentence contains arch.
@ b @ @ Does not have one of those words.
@ b @ @ Has the word ver.
@ c @ @ Does not have one of those words.
@ c @ @ Does not have one of those words.
@ c @ @ Does not have one of those words.
@ c @ @ This sentence contains var.
@ c @ @ This sentence contains fam.
@ c @ @ This sentence contains fam.
最后,仅在第二种类别的单独类别中进行排序,将“从第3列中包含最多重复的条目”到“包含第3列中最少数量的重复条目”的行排序,例如:
@ a @ fish @ Does not have one of those words.
@ a @ fish @ Does not have one of those words.
@ a @ fish @ Does not have one of those words.
@ a @ tiger @ Does not have one of those words.
@ a @ bear @ This sentence contains arch.
@ b @ fish @ Does not have one of those words.
@ b @ fish @ Has the word ver.
@ c @ bear @ Does not have one of those words.
@ c @ bear @ Does not have one of those words.
@ c @ fish @ Does not have one of those words.
@ c @ tiger @ This sentence contains var.
@ c @ tiger @ This sentence contains fam.
@ c @ bear @ This sentence contains fam.
如何按字母2按字母顺序对文件进行排序,第4列中某些关键字的出现,以及第3列中最常见的重复到最不常见的副本?
答案 0 :(得分:3)
TXR:(http://www.nongnu.org/txr)
@(bind special-words ("arch." "var." "ver." "anci." "fam."))
@(bind ahash @(hash :equal-based))
@(repeat)
@id @@ @alpha @@ @animal @@ @words
@ (rebind words @(split-str words " "))
@ (bind record (id alpha animal words))
@ (do (push record [ahash alpha]))
@(end)
@(bind sorted-rec-groups nil)
@(do
(defun popularity-sort (recs)
(let ((histogram [group-reduce (hash)
third (do inc @1)
recs 0]))
[sort recs > [chain third histogram]]))
(dohash (key records ahash)
(let (contains does-not combined)
(each* ((r records)
(w [mapcar fourth r]))
(if (isec w special-words)
(push r contains)
(push r does-not)))
(push (append (popularity-sort does-not)
(popularity-sort contains))
sorted-rec-groups)))
(set sorted-rec-groups [sort sorted-rec-groups :
[chain first second]]))
@(output)
@ (repeat)
@ (repeat)
@(rep)@{sorted-rec-groups} @@ @(last)@{sorted-rec-groups " "}@(end)
@ (end)
@ (end)
@(end)
数据:
0001 @ b @ fish @ Does not have one of those words.
0002 @ a @ bear @ Does not have one of those words.
0003 @ b @ bear @ Has the word ver.
0004 @ a @ fish @ Does not have one of those words.
0005 @ c @ bear @ Does not have one of those words.
0006 @ c @ bear @ Does not have one of those words.
0007 @ a @ fish @ Does not have one of those words.
0008 @ c @ fish @ Does not have one of those words.
0009 @ a @ fish @ Does not have one of those words.
0010 @ c @ tiger @ This sentence contains var.
0011 @ c @ bear @ This sentence contains fam.
0012 @ a @ fish @ Does not have one of those words.
0013 @ c @ tiger @ This sentence contains fam.
执行命令
$ txr sort.txr data.txt
0004 @ a @ fish @ Does not have one of those words.
0007 @ a @ fish @ Does not have one of those words.
0009 @ a @ fish @ Does not have one of those words.
0012 @ a @ fish @ Does not have one of those words.
0002 @ a @ bear @ Does not have one of those words.
0001 @ b @ fish @ Does not have one of those words.
0003 @ b @ bear @ Has the word ver.
0005 @ c @ bear @ Does not have one of those words.
0006 @ c @ bear @ Does not have one of those words.
0008 @ c @ fish @ Does not have one of those words.
0010 @ c @ tiger @ This sentence contains var.
0013 @ c @ tiger @ This sentence contains fam.
0011 @ c @ bear @ This sentence contains fam.
答案 1 :(得分:2)
以下是帮助您入门的第一个问题的答案:
sort data -t "@" -k 2,2 -k 3,4
工作原理:
答案 2 :(得分:1)
这是Ruby的解决方案。
#!/usr/bin/env ruby
class Row
SEPARATOR = " @ "
attr_accessor :cols
def initialize(text)
@cols = text.chomp.split(SEPARATOR)
@cols.size == 4 or raise "Expected text to have four columns: #{text}"
duplicate_increment
end
def has_words?
cols[3]=~/arch\.|var\.|ver\.|anci\.|fam\./ ? true : false
end
def to_s
SEPARATOR +
@cols[1,3].join(SEPARATOR) +
" -- id:#{cols[0]} duplicates:#{duplicate_count}"
end
### Comparison
def <=>(other)
other or raise "Expected other to exist"
cmp = self.cols[1] <=> other.cols[1]
return cmp if cmp !=0
cmp = (self.has_words? ? 1 : -1) <=> (other.has_words? ? 1 : -1)
return cmp if cmp !=0
other.duplicate_count <=> self.duplicate_count
end
### Track duplicate entries
@@duplicate_count = Hash.new{|h,k| h[k]=0}
def duplicate_key
[cols[1],has_words?]
end
def duplicate_count
@@duplicate_count[duplicate_key]
end
def duplicate_increment
@@duplicate_count[duplicate_key] += 1
end
end
### Main
lines = ARGF
rows = lines.map{|line| Row.new(line) }
sorted_rows = rows.sort
sorted_rows.each{|row| puts row }
输入:
0001 @ b @ fish @ text
0002 @ a @ bear @ text
0003 @ b @ bear @ ver.
0004 @ a @ fish @ text
0005 @ c @ bear @ text
0006 @ c @ bear @ text
0007 @ a @ fish @ text
0008 @ c @ fish @ text
0009 @ a @ fish @ text
0010 @ c @ lion @ var.
0011 @ c @ bear @ fam.
0012 @ a @ fish @ text
0013 @ c @ lion @ fam.
输出:
$ cat data.txt | ./sorter.rb
@ a @ fish @ text -- id:0007 duplicates:5
@ a @ bear @ text -- id:0002 duplicates:5
@ a @ fish @ text -- id:0012 duplicates:5
@ a @ fish @ text -- id:0004 duplicates:5
@ a @ fish @ text -- id:0009 duplicates:5
@ b @ fish @ text -- id:0001 duplicates:1
@ b @ bear @ ver. -- id:0003 duplicates:1
@ c @ bear @ text -- id:0005 duplicates:3
@ c @ fish @ text -- id:0008 duplicates:3
@ c @ bear @ text -- id:0006 duplicates:3
@ c @ lion @ var. -- id:0010 duplicates:3
@ c @ bear @ fam. -- id:0011 duplicates:3
@ c @ lion @ fam. -- id:0013 duplicates:3
答案 3 :(得分:0)
这可能适合你(非常不优雅!):
sed 's/[^@]*@\([^@\]*\)@\([^@]*\)/\1\t\2\t&/;h;s/@/&\n/3;s/.*\n//;/\(arch\|var\|ver\|anci\|fam\)\./!ba;s/.*/1/;bb;:a;s/.*/0/;:b;G;s/\(.\)\n\([^\t]*\)/\2\t\1/' file |
sort |
tee file1 |
sed 's/\(.*\)\t.*/\1/' |
uniq -c |
sed 's|^\s*\(\S*\) \(.*\t.*\t\(.*\)\)|/^\2/s/\3/\1/|' >file.sed
sed -f file.sed file1 |
sort -k1,2 -k3,3nr |
sed 's/\t/\n/3;s/.*\n//'
1 @ a @ fish @ Does not have one of those words.
2 @ a @ fish @ Does not have one of those words.
3 @ a @ fish @ Does not have one of those words.
4 @ a @ tiger @ Does not have one of those words.
5 @ a @ bear @ This sentence contains arch.
6 @ b @ fish @ Does not have one of those words.
7 @ b @ fish @ Has the word ver.
8 @ c @ bear @ Does not have one of those words.
9 @ c @ bear @ Does not have one of those words.
10 @ c @ fish @ Does not have one of those words.
11 @ c @ tiger @ This sentence contains var.
12 @ c @ tiger @ This sentence contains fam.
13 @ c @ bear @ This sentence contains fam.
说明:
制作包含以下内容的排序键:
最终使用上述键对文件进行排序,然后删除键。
答案 4 :(得分:0)
首先,我load使用“csv”并将其转到右侧shape。测试数据在我的计算机上称为“蠕虫”但由于q不使用字符串作为文件名“type”(以防止例如注入攻击),我需要使用hsym来创建“文件名” “:
t:flip `id`a`b`c!("SSSS";"@")0:hsym`worms;
然后我研究了哪个“第四场”条目包含了你的一个词。我使用like构建了一个位图,然后将其应用于each row(left),然后each pattern(right)将其应用于不存在该单词的0,或者将其中一个应用于:{/ p>
t:update p:any each c like/:\:("*arch.*";"*var.*";"*ver.*";"*anci.*";"*fam.*") from t;
然后我想找到重复的数量。这只是第2列(a),第3栏(b)栏以及当前类别中的计数of rows:
t:update d:neg count i by a,b,p from t;
最后,我因为negated计数,我的所有价值观都“以同样的方式”,所以我可以通过这三列简单地sort:
`a`p`d xasc t