Question

我正在尝试从文件集合中创建单词词典。是否有一种简单的方法可以打印文件中的所有单词，每行一个？

Answer 1

您可以使用grep：

-E '\w+'搜索字词
-o仅打印与

% cat temp
Some examples use "The quick brown fox jumped over the lazy dog,"
rather than "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
for example text.
# if you don't care whether words repeat
% grep -o -E '\w+' temp
Some
examples
use
The
quick
brown
fox
jumped
over
the
lazy
dog
rather
than
Lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
for
example
text

如果您只打印一次单词，无视大小写，可以使用sort

-u仅打印每个单词
-f告诉sort在比较单词

# if you only want each word once
% grep -o -E '\w+' temp | sort -u -f
adipiscing
amet
brown
consectetur
dog
dolor
elit
example
examples
for
fox
ipsum
jumped
lazy
Lorem
over
quick
rather
sit
Some
text
than
The
use

Answer 2

一个好的开始就是简单地使用sed用换行符替换所有空格，删除空行（再次使用sed），然后使用sort删除-u （uniquify）标志以删除重复项，如下例所示：

$ echo "the quick brown dog and fox jumped
over the lazy   dog" | sed 's/ /\n/g' | sed '/^$/d' | sort -u

and
brown
dog
fox
jumped
lazy
over
quick
the

然后你可以开始担心标点符号等。

Answer 3

假设用空格分隔的单词

awk '{for(i=1;i<=NF;i++)print $i}' file

或

 tr ' ' "\n" < file

如果你想要唯一性：

awk '{for(i=1;i<=NF;i++)_[$i]++}END{for(i in _) print i}' file

tr ' ' "\n" < file | sort -u

删除了一些标点符号。

awk '{
    gsub(/["*^&()#@$,?~]/,"")
    for(i=1;i<=NF;i++){  _[$i]  }
}
END{    for(o in _){ print o }  }' file

Answer 4

Ken Church's "Unix(TM) for Poets" (PDF)正好描述了这种类型的应用程序 - 从文本文件中提取单词，对它们进行排序和计数等等。

从文件中提取单词

4 个答案: