使用sed删除禁用词列表中的单词(提供sed要从文本文件中删除的参数列表)

时间:2013-02-07 05:34:44

标签: unix sed awk grep

所以,我们都知道sed很擅长查找和替换文件中出现的所有单词:

sed -i 's/original_word/new_word/g' file.txt

但是,有人可以告诉我如何从文件中提取sed“original_words”列表(类似于grep -f)吗?我只想用''(擦除它们)替换所有。

原始wordlist文件只是一串用line(wordlist.txt)分隔的停用词:

a
about
above
according
across
after
afterwards

这将是一种简单的方法来获取一个停用词列表并从语料库中对其进行核对(对于清理数据非常有用)。

file.txt看起来像

05ricardo   RT @shakira: Immigration reform isn't about politics. It's about people mothers, kids. Obama is working for all of them. http://t.co/rAW ...    0
05ricardo   ?@ItsReginaG: Don't vote Obama. Because you will lose jobs, and die.? Lol   0
05ricardo   ?@shakira: Obama doubles Pell Grants - 700,000 more Latinos get help to go to college. Meet Johanny Adames http://t.co/EMg8NLGl Shak?. ?    -1
05rodriguez_a   My Comm teacher gave me a copy of Obama's speech that he gave the other night and I cried while reading it. It was that moving. -3

4 个答案:

答案 0 :(得分:2)

您也可以让sed为您编写sed脚本(使用GNU sed测试):

<stopwords sed 's:.*:s/\\b&\\b//:g' | sed -f - file.txt

输出:

05ricardo   RT @shakira: Immigration reform isn't  politics. It's about people mothers, kids. Obama is working for all of them. http://t.co/rAW ...    0
05ricardo   ?@ItsReginaG: Don't vote Obama. Because you will lose jobs, and die.? Lol   0
05ricardo   ?@shakira: Obama doubles Pell Grants - 700,000 more Latinos get help to go to college. Meet Johanny Adames http://t.co/EMg8NLGl Shak?. ?    -1
05rodriguez_a   My Comm teacher gave me  copy of Obama's speech that he gave the other night and I cried while reading it. It was that moving. -3

答案 1 :(得分:1)

首先,并非所有sed支持-i,但它不是必要的选项,因为以一般方式提供该功能是微不足道的。一个简单的选择(假设一个非csh系列shell):

inline() { f=$1; shift; "$@" < $f > $f.out && mv $f.out $f; }

然后,要进行替换(你没有指定你想如何处理单词分隔符,所以如果“foo”在黑名单中“bar foo baz”将在“bar”和“bar”之间有两个空格baz“)使用awk或perl非常简单:

awk 'NR==FNR{a[$0]; next} {for( i in a ) gsub( i, "" )} 1' original-words file.txt
perl -wne 'if( $ARGV = $ARGV[0] ){ chop; push @no, $_; next } 
    foreach $x( @no ) {s/$x//g } print ' original-words file.txt

如果您对结果感到满意,请使用-iperl(并非所有sed支持-i,但所有perl&gt; 5.0)或者您可以使用以下命令修改文件:

inline file.txt awk 'NR==FNR{a[$0]; next} 
    {for( i in a ) gsub( i, "" )} 1' original-words -

这些解决方案中的任何一个都比为黑名单中的每个单词调用sed快得多。

答案 2 :(得分:0)

这是使用GNU sed的一种方式:

while IFS= read -r word; do sed -ri "s/( |)\b$word\b//g" file; done < wordlist

档案内容:

how about I decide to look at it afterwards. What
across do you think? Is it a good idea to go out and about? I 
think I'd rather go up and above.

结果:

how I decide to look at it. What
 do you think? Is it good idea to go out and? I 
think I'd rather go up and.

答案 3 :(得分:-1)

cat file.txt | grep  -vf wordlist.txt