我有一个这样的文件:
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
我想生成一个两列列表。第一列显示出现的单词,第二列显示出现的频率,例如:
this@1
is@1
a@1
file@1
with@1
many@1
words3
some@2
of@2
the@2
only@1
appear@2
more@1
than@1
one@1
once@1
time@1
words
和word
可以算作两个单独的词。到目前为止,我有这个:
sed -i "s/ /\n/g" ./file1.txt # put all words on a new line
while read line
do
count="$(grep -c $line file1.txt)"
echo $line"@"$count >> file2.txt # add word and frequency to file
done < ./file1.txt
sort -u -d # remove duplicate lines
出于某种原因,这只是在每个单词后面显示“0”。
如何生成文件中出现的每个单词的列表以及频率信息?
答案 0 :(得分:57)
不是sed
和grep
,而是tr
,sort
,uniq
和awk
:
% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF
a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1
答案 1 :(得分:40)
uniq -c 已经做了你想做的事,只需对输入进行排序:
echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c
输出:
6 a
7 d
7 s
答案 2 :(得分:5)
输入文件的内容
$ cat inputFile.txt
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
使用sed | sort | uniq
$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' inputFile.txt | sort | uniq -c
1 a
2 appear
1 file
1 is
1 many
1 more
2 of
1 once
1 one
1 only
2 some
1 than
2 the
1 this
1 time
1 with
3 words
uniq -ic
将计算并忽略大小写,但结果列表将包含This
而不是this
。
答案 3 :(得分:3)
此功能按降序列出所提供文件中出现的每个单词的频率:
function wordfrequency() {
awk '
BEGIN { FS="[^a-zA-Z]+" } {
for (i=1; i<=NF; i++) {
word = tolower($i)
words[word]++
}
}
END {
for (w in words)
printf("%3d %s\n", words[w], w)
} ' | sort -rn
}
您可以在文件中调用它:
$ cat your_file.txt | wordfrequency
答案 4 :(得分:2)
这可能对您有用:
tr '[:upper:]' '[:lower:]' <file |
tr -d '[:punct:]' |
tr -s ' ' '\n' |
sort |
uniq -c |
sed 's/ *\([0-9]*\) \(.*\)/\2@\1/'
答案 5 :(得分:2)
让我们在Python 3中做到这一点!
"""Counts the frequency of each word in the given text; words are defined as
entities separated by whitespaces; punctuations and other symbols are ignored;
case-insensitive; input can be passed through stdin or through a file specified
as an argument; prints highest frequency words first"""
# Case-insensitive
# Ignore punctuations `~!@#$%^&*()_-+={}[]\|:;"'<>,.?/
import sys
# Find if input is being given through stdin or from a file
lines = None
if len(sys.argv) == 1:
lines = sys.stdin
else:
lines = open(sys.argv[1])
D = {}
for line in lines:
for word in line.split():
word = ''.join(list(filter(
lambda ch: ch not in "`~!@#$%^&*()_-+={}[]\\|:;\"'<>,.?/",
word)))
word = word.lower()
if word in D:
D[word] += 1
else:
D[word] = 1
for word in sorted(D, key=D.get, reverse=True):
print(word + ' ' + str(D[word]))
让我们将此脚本命名为“frequency.py”,并在“〜/ .bash_aliases”中添加一行:
alias freq="python3 /path/to/frequency.py"
现在要查找文件“content.txt”中的频率词,您可以:
freq content.txt
您也可以输出到它:
cat content.txt | freq
甚至分析来自多个文件的文本:
cat content.txt story.txt article.txt | freq
如果您使用的是Python 2,请替换
''.join(list(filter(args...)))
与filter(args...)
python3
与python
print(whatever)
与print whatever
答案 6 :(得分:1)
排序需要GNU AWK(gawk
)。如果您有另一个没有asort()
的AWK,可以轻松调整,然后通过管道传送到sort
。
awk '{gsub(/\./, ""); for (i = 1; i <= NF; i++) {w = tolower($i); count[w]++; words[w] = w}} END {qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]]}' inputfile
分成多行:
awk '{
gsub(/\./, "");
for (i = 1; i <= NF; i++) {
w = tolower($i);
count[w]++;
words[w] = w
}
}
END {
qty = asort(words);
for (w = 1; w <= qty; w++)
print words[w] "@" count[words[w]]
}' inputfile
答案 7 :(得分:1)
您可以使用tr来运行
tr ' ' '\12' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt
城市名称文本文件的示例输出:
3026 Toronto
2006 Montréal
1117 Edmonton
1048 Calgary
905 Ottawa
724 Winnipeg
673 Vancouver
495 Brampton
489 Mississauga
482 London
467 Hamilton
答案 8 :(得分:1)
如果我的file.txt中包含以下文本。
This is line number one
This is Line Number Tow
this is Line Number tow
我可以使用以下cmd找到每个单词的频率。
cat file.txt | tr ' ' '\n' | sort | uniq -c
输出:
3 is
1 line
2 Line
1 number
2 Number
1 one
1 this
2 This
1 tow
1 Tow
答案 9 :(得分:1)
这是一个更复杂的任务。我们至少需要将以下内容纳入帐户:
$ file the-king-james-bible.txt
the-king-james-bible.txt: UTF-8 Unicode (with BOM) text
BOM 是文件中的第一个元字符。如果不删除,可能会错误地影响一个词。
以下是使用 AWK 的解决方案。
{
if (NR == 1) {
sub(/^\xef\xbb\xbf/,"")
}
gsub(/[,;!()*:?.]*/, "")
for (i = 1; i <= NF; i++) {
if ($i ~ /^[0-9]/) {
continue
}
w = $i
words[w]++
}
}
END {
for (idx in words) {
print idx, words[idx]
}
}
它删除了 BOM 字符并替换了标点符号。它确实 不要小写单词。此外,由于该程序用于计算圣经的字数,因此会跳过所有经文(带有 continue 的 if 条件)。
$ awk -f word_freq.awk the-king-james-bible.txt > bible_words.txt
我们运行程序并将输出写入文件。
$ sort -nr -k 2 bible_words.txt | head
the 62103
and 38848
of 34478
to 13400
And 12846
that 12576
in 12331
shall 9760
he 9665
unto 8942
使用 sort
和 head
,我们可以找到圣经中出现频率最高的十个词。
答案 10 :(得分:0)
#!/usr/bin/env bash
declare -A map
words="$1"
[[ -f $1 ]] || { echo "usage: $(basename $0 wordfile)"; exit 1 ;}
while read line; do
for word in $line; do
((map[$word]++))
done;
done < <(cat $words )
for key in ${!map[@]}; do
echo "the word $key appears ${map[$key]} times"
done|sort -nr -k5
答案 11 :(得分:0)
awk '{
BEGIN{word[""]=0;}
{
for (el =1 ; el <= NF ; ++el) {word[$el]++ }
}
END {
for (i in word) {
if (i !="")
{
print word[i],i;
}
}
}' file.txt | sort -nr