我正在尝试从文本中删除停用词。
我尝试使用下面的代码。
from nltk.corpus import stopwords
sw = stopwords.words("english")
my_text='I love coding'
my_text=re.sub("|".join(sw),"",my_text)
print(my_text)
预期结果:love coding
。
实际结果:I l cng
(因为在停用词列表“ sw”中都找到了“ o”和“ ve”)。
如何获得预期的结果?
答案 0 :(得分:1)
将句子拆分为单词,然后再删除停用词,然后运行
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<div class="box">
<div class="container">
<div class="row">
<div class="col-lg-4 col-md-4 col-sm-4 col-xs-12">
<div class="box-part text-center ouverture">
<i class="fa fa-address-book fa-3x" aria-hidden="true"></i>
<div class="title">
<!-- on the element itself -->
<h4><a href="presentation.html" delay-click>default delay</a></h4>
</div>
</div>
</div>
<div class="col-lg-4 col-md-4 col-sm-4 col-xs-12">
<!-- or somewhere up the ranks -->
<div class="box-part text-center ouverture" delay-click=5000>
<i class="fa fa-address-book fa-3x" aria-hidden="true"></i>
<div class="title">
<h4><a href="presentation.html">delay 5s</a></h4>
</div>
</div>
</div>
<div class="col-lg-4 col-md-4 col-sm-4 col-xs-12">
<div class="box-part text-center ouverture">
<i class="fa fa-address-book fa-3x" aria-hidden="true"></i>
<div class="title">
<h4><a href="presentation.html" id="pres">immediate</a></h4>
</div>
</div>
</div>
</div>
<div class="form-group">
<!-- works here too -->
<input type="checkbox" class="form-control" delay-click />
</div>
</div>
</div>
答案 1 :(得分:0)
您需要替换单词,而不是字符:
from itertools import filterfalse
from nltk.corpus import stopwords
sw = stopwords.words("english")
my_text = 'I love coding'
my_words = my_text.split() # naive split to words
no_stopwords = ' '.join(filterfalse(sw.__contains__, my_words))
您还应该担心句子拆分,区分大小写等问题。
由于这是一个常见的,不重要的问题,因此有一些库可以正确执行此操作。