Question

我知道已经发布了类似的问题，但我还没有找到我的查询的答案。所以我有一个文本文件和另一个包含停用词列表（http://www.textfixer.com/resources/common-english-words.txt）的文件。我需要从我的文本文件中删除common-english-words.txt中的单词。

Answer 1

结合一些工具可以给你一个提示。

sed 's/('"$(tr ',' '|' < common-english-words.txt)"')//g' myfile.txt > out.txt

我看到common-english-words.txt文件是用逗号分隔的单词列表，因此如果用条形替换逗号，则会得到与其中任何一个匹配的正则表达式。然后，您可以使用sed删除它们。

执行的实际命令如下：

sed 's/(a|able|about|...)//g' myfile.txt > out.txt

只是从列表中删除单词并将输出发送到out.txt。

Answer 2

答案：

sed 's/,/ /g' filename >> out.txt (to change the commas into white space)

tr ' ' '\n' <out.txt >>out1.txt (to put all the stop words onto new lines)

tr -c '[:alnum:]' '[\n*]' < JJ.txt | fgrep -i -v -w -f out1.txt | sort | uniq -c | sort - nr | head -20 (for counting the most frequent 20 words excluding the stop words)

从Bash中的另一个文本文件中删除文件中包含的单词

2 个答案: