我正在尝试按照它们出现在文件中的顺序对单词文件进行排序(我只对文件中的某些单词感兴趣)。第一个单词出现在输出的顶部,最后一个单词出现在底部。
使用sort | uniq -c
生成字数统计的常用方法可以消除排序顺序。如何在不丢失排序的情况下生成此频率计数?
示例文本文件:
Godard意识到aioli Ouija Aeolus胜利愤怒的完美家庭十二银七混杂放射性你星期四心脏吃了披萨传染附近公主离子水ace火成功雄心勃勃
示例输出:
1 conscious
1 aioli
1 Ouija
1 Aeolus
1 victorious
1 furious
1 promiscuous
1 radioactive
1 contagious
1 igneous
1 ambitious
答案 0 :(得分:3)
awk
救援!
双重扫描以获得计数
$ awk -v RS=' +|\n' 'NR==FNR {t=$0; if(gsub(/[aeiou]/,"")>2) a[t]++; next}
$0 in a {print a[$0],$0; delete a[$0]}' file{,}
1 conscious
1 aioli
1 Ouija
1 Aeolus
1 victorious
1 furious
1 promiscuous
1 radioactive
1 contagious
1 igneous
1 ambitious
从排序列表中提取了一些其他方法,这将根据输入排序生成计数
$ awk -v RS=' +|\n' '{t=$0} gsub(/[aeiou]/,"")>2{print t}' file |
# or some other means to generate filtered words ...
cat -n | # add line number
sort -k2 -k1n | # sort by words and line number
uniq -f1 -c | # find counts skipping line number
sort -k2n | # sort by original line number
awk '{print $1,$3}' # remove the line number
答案 1 :(得分:2)
以下命令:
s='Godard conscious aioli Ouija Aeolus victorious furious perfect family twelve silver seven promiscuous radioactive one you Thursday heart Ate pizza contagious near princess ion water ace igneous ambitious'
tr '[[:space:]]' '\n' <<<"$s" | egrep -i '[aeoiu].*[aeiou].*[aeiou]'
...生成输出:
conscious
aioli
Ouija
Aeolus
victorious
furious
promiscuous
radioactive
contagious
igneous
ambitious
...正确包含至少有三个元音的单词子集,按其原始出现顺序排列。
维持一个计数器需要维持状态或多次传递。
#!/usr/bin/env bash
if [[ -z $BASH_VERSION ]] || [[ $BASH_VERSION = [1-3].* ]]; then
echo "ERROR: This requires bash 4.0 or newer" >&2
exit 1
fi
### Begin code from Part 1
s='Godard conscious aioli Ouija Aeolus victorious furious perfect family twelve silver seven promiscuous radioactive one you Thursday heart Ate pizza contagious near princess ion water ace igneous ambitious'
get_words() { tr '[[:space:]]' '\n' <<<"$s" | egrep -i '[aeoiu].*[aeiou].*[aeiou]'; }
### End code from Part 1
declare -a var_order=( )
declare -A var_count=( )
while IFS= read -r var; do
if (( ${var_count[$var]} )); then
var_count[$var]=$(( ${var_count[$var]} + 1 ))
else
var_order+=( "$var" )
var_count[$var]=1
fi
done < <(get_words)
for var in "${var_order[@]}"; do
printf '% -4d %s\n' "${var_count[$var]}" "$var"
done
...正确生成输出:
1 conscious
1 aioli
1 Ouija
1 Aeolus
1 victorious
1 furious
1 promiscuous
1 radioactive
1 contagious
1 igneous
1 ambitious
答案 2 :(得分:1)
我以为我也应该参与其中。
这是一个单行,只为查尔斯:
gawk -v RS="[[:space:]]+" '{$0=tolower($0)} /[aeiou]{3}/ && !($0 in p) {p[$0]=n++} /[aeiou]{3}/ {a[p[$0]]=$0;c[p[$0]]++} END { for (i=0;i<n;i++) printf "%6d %s\n",c[i],a[i] }' input.txt
分发以便于阅读(和评论):
#!/usr/bin/env gawk -f
BEGIN {
RS="[[:space:]]+" # Set a reasonable record separator
} # (includes spaces and newlines)
{
$0=tolower($0) # ignore case...
}
/[aeiou]{3}/ && !($0 in p) { # if we've found a word, make sure
p[$0]=n++ # we have a pointer to it.
}
/[aeiou]{3}/ { # if we've found a word and have a pointer,
a[p[$0]]=$0 # make a record of the word,
c[p[$0]]++ # and increment its counter.
}
END { # Once everything's been processed,
for (i=0;i<n;i++) # step through our list, and
printf "%6d %s\n",c[i],a[i] # print the results.
}
这涵盖了多种形式的空白,准确计算,并保持单词有序。哦,它只需一次就能完成。
答案 3 :(得分:0)
考虑更多可能的输入
cat txt1
Godard意识到aioli Ouija Aeolus战胜了愤怒的完美家庭 十二银七混杂的放射性你周四雄心勃勃 心脏吃了披萨传染性公主离子水王牌雄心勃勃 火力雄心勃勃的意识
以下awk
脚本可以解决问题:
awk 'NR==FNR {v[i++]=$0;c[$0]++;next}END{
for(j=0;j<i;j++){if(p[v[j]]==0){print c[v[j]],v[j]}
p[v[j]]=c[v[j]]>1?1:0;}
}' <(awk -v RS=' +|\n' '$0 ~ /(.*[aAeEiIoOuU].*){3}/' txt1)
<强>输出强>
2 conscious
1 aioli
1 Ouija
1 Aeolus
1 victorious
1 furious
1 promiscuous
1 radioactive
3 ambitious
1 contagious
1 igneous
答案 4 :(得分:0)
在普通的bash中,你可以这样做:
set -f
shopt -s nocasematch
for word in $(< words.txt); do
[[ $word == *[aeiou][aeiou][aeiou]* ]] && echo $word
done
只打印出连续3个元音的单词,不计算它们。