我正在尝试通过使用nltk工具包删除停用词来处理用户输入的文本,但使用停用词删除时,会删除“and”,“或”,“not”等字词。我希望在禁用词删除过程之后出现这些单词,因为它们是稍后将文本作为查询处理所需的运算符。我不知道哪些是文本查询中可以成为运算符的单词,我也想从文本中删除不必要的单词。
答案 0 :(得分:138)
NLTK
中有一个内置的停用词列表,由11种语言的2,400个停用词组成(Porter等),请参阅http://nltk.org/book/ch02.html
>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> stop = set(stopwords.words('english'))
>>> sentence = "this is a foo bar sentence"
>>> print([i for i in sentence.lower().split() if i not in stop])
['foo', 'bar', 'sentence']
>>> [i for i in word_tokenize(sentence.lower()) if i not in stop]
['foo', 'bar', 'sentence']
我建议您使用tf-idf删除停用词,请参阅Effects of Stemming on the term frequency?
答案 1 :(得分:67)
我建议您创建自己的禁用词列表中的操作员单词列表。可以方便地减去集合,因此:
operators = set(('and', 'or', 'not'))
stop = set(stopwords...) - operators
然后,您可以简单地测试单词是in
还是not in
,而不依赖于您的运算符是否属于禁用词列表。然后,您可以稍后切换到另一个禁用词列表或添加运算符。
if word.lower() not in stop:
# use word
答案 2 :(得分:31)
@ alvas的答案可以胜任,但可以更快地完成。假设你有documents
:字符串列表。
from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize
stop_words = set(stopwords.words('english'))
stop_words.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}']) # remove it if you need punctuation
for doc in documents:
list_of_words = [i.lower() for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]
请注意,由于您在这里搜索集合(不在列表中),理论上速度理论上会len(stop_words)/2
倍,如果您需要操作许多文档,这一点很重要。
对于每篇约300字的5000份文件,我的例子为1.8秒,@ alvas为20秒。
P.S。在大多数情况下,您需要将文本划分为单词以执行其他使用tf-idf的分类任务。所以最有可能的是使用词干分析器会更好:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
并在循环中使用[porter.stem(i.lower()) for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]
。
答案 3 :(得分:14)
@alvas有一个很好的答案。但同样取决于任务的性质,例如在您的应用程序中,您要考虑所有conjunction
,例如和,或者,但是,如果,和所有determiner
,例如,a,some,most,every,no 作为停止词,将所有其他词性视为合法,那么你可能想要研究这个使用词性标注集来丢弃单词的解决方案,Check table 5.1:
import nltk
STOP_TYPES = ['DET', 'CNJ']
text = "some data here "
tokens = nltk.pos_tag(nltk.word_tokenize(text))
good_words = [w for w, wtype in tokens if wtype not in STOP_TYPES]
答案 4 :(得分:5)
您可以将string.punctuation与内置的NLTK停用词列表一起使用:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation
words = tokenize(text)
wordsWOStopwords = removeStopWords(words)
def tokenize(text):
sents = sent_tokenize(text)
return [word_tokenize(sent) for sent in sents]
def removeStopWords(words):
customStopWords = set(stopwords.words('english')+list(punctuation))
return [word for word in words if word not in customStopWords]
NLTK停用词完成list
答案 5 :(得分:0)
从字符串中删除停用词
这里我还添加了自定义停用词列表
nltk.download('stopwords')
from nltk.corpus import stopwords # Stop words
stop_words = set(stopwords.words('english'))
stop_words.update(list(set(['zero' , 'one' , 'two' ,
'three' , 'four' , 'five' ,
'six' , 'seven' , 'eight' ,
'nine' , 'ten' ,
'may' , 'also' , 'across' ,
'among' , 'beside' , 'however' ,
'yet' , 'within' ,
'jan' , 'feb' , 'mar' ,
'apr' , 'may' , 'jun' ,
'jul' , 'aug' , 'sep' ,
'oct' , 'nov' , 'dec' ,
'january' , 'february', 'march' ,
'april' , 'may' , 'june' ,
'july' , 'august' , 'september',
'october' , 'november', 'december' ,
'summer' , 'winter' , 'fall' ,
'spring'
"a" , "about" , "above" , "after" ,
"again" , "against" , "ain" , "aren't" ,
"all" , "am" , "an" , "and" ,
"any" , "are" , "aren" , "as" ,
"at" ,
"be" , "because" , "been" , "before" ,
"being" , "below" , "between", "both" ,
"but" , "by" ,
"can" , "couldn" , "couldn't" , "could" ,
"d" , "did" , "didn" , "didn't" ,
"do" , "does" , "doesn" , "doesn't" ,
"doing" , "don" , "don't" , "down" ,
"during" ,
"each" ,
"few" , "for" , "from" , "further" ,
"had" , "hadn" , "hadn't" , "has" ,
"hasn" , "hasn't" , "have" , "haven" ,
"haven't" , "having" , "he" , "her" ,
"here" , "hers" , "herself" , "him" ,
"himself" , "his" , "how" ,
"he'd" , "he'll" , "he's" , "here's" ,
"how's" ,
"i" , "if" , "in" , "into" ,
"is" , "isn" , "isn't" , "it" ,
"it's" , "its" , "itself" , "i'd" ,
"i'll" , "i'm" , "i've" ,
"just" ,
"ll" , "let's" ,
"m" , "ma" ,"me" ,
"mightn" , "mightn't" , "more" , "most" ,
"mustn" , "mustn't" , "my" , "myself" ,
"needn" , "needn't" , "no" , "nor" ,
"not" , "now" ,
"o" , "of" , "off" , "on" ,
"once" , "only" , "or" , "other" ,
"our" , "ours" , "ourselves" , "out" ,
"over" , "own" , "ought" ,
"re" ,
"s" , "same" , "shan" , "shan't" ,
"she" , "she's" , "should" , "should've",
"shouldn" , "shouldn't", "so" , "some" ,
"such" , "she'd" , "she'll" ,
"t" , "than" , "that" , "that'll" ,
"the" , "their" , "theirs" , "them" ,
"themselves", "then" , "there" , "these" ,
"they" , "this" , "those" , "through" ,
"to" , "too" , "that's" , "there's" ,
"they'd" , "they'll" , "they're" , "they've" ,
"under" , "until" , "up" ,
"ve" , "very" ,
"was" , "wasn" , "wasn't" , "we" ,
"were" , "weren" , "weren't" , "what" ,
"when" , "where" , "which" , "while" ,
"who" , "whom" , "why" , "will" ,
"with" , "won" , "won't" , "wouldn" ,
"wouldn't" , "we'd" , "we'll" , "we're" ,
"we've" , "what's" , "when's" , "where's" ,
"who's" , "why's" , "would" ,
"y" , "you" , "you'd" , "you'll" ,
"you're" , "you've" , "your" , "yours" , "yourself",
"yourselves",
'a',"able", "abst", "accordance", "according", "accordingly", "across", "act", "actually" ,
"added", "adj", "affected", "affecting", "affects", "afterwards", "ah", "almost" ,
"alone", "along", "already", "also", "although", "always", "among", "amongst", "anyone" ,
"announce", "another", "anybody", "anyhow", "anymore", "anything", "anyway", "anyways" ,
"anywhere", "apparently", "approximately", "arent", "arise", "around", "aside", "ask" ,
"asking", "auth", "available", "away", "awfully", "a's", "ain't", "allow", "allows", "apart" ,
"appear", "appreciate", "appropriate", "associated" ,
"b", "back", "became", "become", "becomes", "becoming", "beforehand", "begin", "beginning" ,
"beginnings", "begins", "behind", "believe", "beside", "besides", "beyond", "biol", "brief" ,
"briefly" ,
"c", "ca", "came", "cannot", "can't", "cause", "causes", "certain", "certainly", "co", "com" ,
"come", "comes", "contain", "containing", "contains", "couldnt" ,
'd',"date", "different", "done", "downwards", "due" ,
"e", "ed", "edu", "effect", "eg", "eight", "eighty", "either", "else", "elsewhere", "end" ,
"ending", "enough", "especially", "et", "etc", "even", "ever", "every", "everybody","except" ,
"everyone", "everything", "everywhere", "ex" ,
"f", "far", "ff", "fifth", "first", "five", "fix", "followed", "following", "follows", "four" ,
"former", "formerly", "forth", "found", "furthermore" ,
"g", "gave", "get", "gets", "getting", "give", "given", "gives", "go", "goes", "got","gone" ,
"gotten", "giving" ,
"h", "happens", "hardly", "hed", "hence", "hereafter", "hereby", "herein", "heres", "however" ,
"hereupon", "hes", "hi", "hid", "hither", "home", "howbeit", "hundred" ,
"id", "ie", "im", "immediately", "importance", "important", "inc", "indeed", "itd", "index" ,
'i',"information", "instead", "invention", "it'll", "inward", "immediate" ,
"j",
"k", "keep", "keeps", "kept", "kg", "km", "know", "known", "knows" ,
"l", "largely", "last", "lately", "later", "latter", "latterly", "least", "less", "lest", "ltd",
"let", "lets", "like", "liked", "likely", "line", "little", "'ll", "look", "looking", "looks" ,
'm',"made", "mainly", "make", "makes", "many", "maybe", "mean", "means", "meantime", "merely", "mg",
"might", "million", "miss", "ml", "moreover", "mostly", "mr", "mrs", "much", "mug", "must" ,
"meanwhile", "may" ,
"n", "na", "name", "namely", "nay", "nd", "near", "nearly", "necessarily", "necessary", "need" ,
"needs", "neither", "never", "nevertheless", "new", "next", "nine", "ninety", "nobody", "non" ,
"none", "nonetheless", "noone", "normally", "nos", "noted", "nothing", "nowhere", "n2", "nc" ,
"nd", "ne", "ng", "ni", "nj", "nl", "nn", "nr", "ns", "nt", "ny" ,
'o',"obtain", "obtained", "obviously", "often", "oh", "ok", "okay", "old", "omitted", "one", "ones",
"onto", "ord", "others", "otherwise", "outside", "overall", "owing", "oa", "ob", "oc", "od" ,
"of", "og", "oi", "oj", "ol", "om", "on", "oo", "oq", "or", "os", "ot", "ou", "ow", "ox", "oz" ,
"p", "page", "pages", "part", "particular", "particularly", "past", "per", "perhaps", "placed" ,
"please", "plus", "poorly", "possible", "possibly", "potentially", "pp", "predominantly" ,
"present", "previously", "primarily", "probably", "promptly", "proud", "provides", "put" ,
"p1", "p2", "p3", "pc", "pd", "pe", "pf", "ph", "pi", "pj", "pk", "pl", "pm", "pn", "po", "pq" ,
"pr", "ps", "pt", "pu", "py" ,
"q", "que", "quickly", "quite", "qv", "qj", "qu" ,
'r',"readily", "really", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards" ,
"related", "relatively", "research", "respectively", "resulted", "resulting", "results", "run" ,
"right", "r2", "ra", "rc", "rd", "rf", "rh", "ri", "rj", "rl", "rm", "rn", "ro", "rq", "rr" ,
"rs", "rt", "ru", "rv", "ry" "r", "ran", "rather", "rd" ,
's',"said", "saw", "say", "saying", "says", "sec", "section", "see", "seeing", "seem", "seemed" ,
"seeming", "seems", "seen", "self", "selves", "sent", "seven", "several", "shall", "shed" ,
"shes", "show", "showed", "shown", "showns", "shows", "significant", "significantly" ,
"similar", "similarly", "since", "six", "slightly", "somebody", "somehow", "someone", "soon" ,
"somewhat", "somewhere", "specifically", "specified", "specify", "specifying", "still", "stop" ,
"strongly", "sub", "substantially", "successfully", "sufficiently", "suggest", "sup", "sure" ,
"s2", "sa", "sc", "sd", "se", "sf", "si", "sj", "sl", "sm", "sn", "sp", "sq", "sr", "ss", "st" ,
"sy", "sz", "sorry", "sometime", "somethan", "something", "sometimes" ,
't',"take", "taken", "taking", "tell", "tends", "thank", "thanx", "that've", "thence", "thereafter",
"thereby", "therefore", "therein", "there'll", "thereof", "therere", "thereto", "thereupon" ,
"there've", "theyd", "theyre", "think", "thou", "though", "thoughh", "thousand", "throug" ,
"throughout", "thru", "thus", "til", "tip", "together", "took", "toward", "towards", "tried" ,
"tries", "truly", "try", "trying", "ts", "twice", "two", "thats", "thanks", "th", "thered" ,
"theres" "t1", "t2", "t3", "tb", "tc", "td", "te", "tf", "th", "ti", "tj", "tl", "tm", "tn" ,
"tp", "tq", "tr", "ts", "tt", "tv", "tx" ,
"u", "un", "unfortunately", "unless", "unlike", "unlikely", "unto", "upon", "ups", "us", "use" ,
"used", "useful", "usefully", "usefulness", "uses", "using", "usually", "ue", "ui", "uj", "uk" ,
"um", "un", "uo", "ur", "ut",
"v", "value", "various", "'ve", "via", "viz", "vol", "vols", "vs", "va", "vd", "vj", "vo", "vq",
"vt", "vu" ,
"w", "want", "wants", "wasnt", "way", "wed", "welcome", "went", "werent", "whatever", "what'll",
"whats", "whence", "whenever", "whereas", "whereby", "wherein", "wheres", "wherever", "whether",
"whim", "whither", "whod", "whoever", "whole", "who'll", "whomever", "whos", "whose", "widely" ,
"whereupon", "willing", "wish", "within", "without", "wont", "words", "world", "wouldnt", "www",
"wi", "wa", "wo",
"x", "x1", "x2", "x3", "xf", "xi", "xj", "xk", "xl", "xn", "xo", "xs", "xt", "xv", "xx",
"yes", "yet", "youd", "youre", "y2", "yj", "yl", "yr", "ys", "yt",
"z", "zero", "zi", "zz"
"best", "better", "c'mon", "c's", "cant", "changes", "clearly", "concerning", "consequently", "consider", "considering", "corresponding", "course", "currently", "definitely", "described", "despite", "entirely", "exactly", "example", "going", "greetings", "hello", "help", "hopefully", "ignored", "inasmuch", "indicate", "indicated", "indicates", "inner", "insofar", "it'd", "keep", "keeps", "novel", "presumably", "reasonably", "second", "secondly", "sensible", "serious", "seriously", "sure", "t's", "third", "thorough", "thoroughly", "three", "well", "wonder", "a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "co", "op", "research-articl", "pagecount", "cit", "ibid", "les", "le", "au", "que", "est", "pas", "vol", "el", "los", "pp", "u201d", "well-b", "http", "volumtype", "par",
"0o", "0s", "3a", "3b", "3d", "6b", "6o",
"a1", "a2", "a3", "a4", "ab", "ac", "ad", "ae", "af", "ag", "aj", "al", "an", "ao", "ap", "ar", "av", "aw", "ax", "ay", "az",
"b1", "b2", "b3", "ba", "bc", "bd", "be", "bi", "bj", "bk", "bl", "bn", "bp", "br", "bs", "bt", "bu", "bx",
"c1", "c2", "c3", "cc", "cd", "ce", "cf", "cg", "ch", "ci", "cj", "cl", "cm", "cn", "cp", "cq", "cr", "cs", "ct", "cu", "cv", "cx", "cy", "cz",
"d2", "da", "dc", "dd", "de", "df", "di", "dj", "dk", "dl", "do", "dp", "dr", "ds", "dt", "du", "dx", "dy",
"e2", "e3", "ea", "ec", "ed", "ee", "ef", "ei", "ej", "el", "em", "en", "eo", "ep", "eq", "er", "es", "et", "eu", "ev", "ex", "ey",
"f2", "fa", "fc", "ff", "fi", "fj", "fl", "fn", "fo", "fr", "fs", "ft", "fu", "fy",
"ga", "ge", "gi", "gj", "gl", "go", "gr", "gs", "gy",
"h2", "h3", "hh", "hi", "hj", "ho", "hr", "hs", "hu", "hy",
"i", "i2", "i3", "i4", "i6", "i7", "i8", "ia", "ib", "ic", "ie", "ig", "ih", "ii", "ij", "il", "in", "io", "ip", "iq", "ir", "iv", "ix", "iy", "iz",
"jj", "jr", "js", "jt", "ju",
"ke", "kg", "kj", "km", "ko",
"l2", "la", "lb", "lc", "lf", "lj", "ln", "lo", "lr", "ls", "lt",
"m2", "ml", "mn", "mo", "ms", "mt", "mu",
'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii','ix', 'x',
'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx',
'xxi', 'xxii', 'xxiii', 'xxiv', 'xxv', 'xxvi', 'xxvii', 'xxviii', 'xxix', 'xxx',
'xxxi', 'xxxii', 'xxxiii', 'xxxiv', 'xxxv', 'xxxvi', 'xxxvii', 'xxxviii', 'xxxix', 'xl',
'xli', 'xlii', 'xliii', 'xliv', 'xlv', 'xlvi', 'xlvii', 'xlviii', 'xlix', 'l',
'li', 'lii', 'liii', 'liv', 'lv', 'lvi', 'lvii', 'lviii', 'lix', 'lx',
'lxi', 'lxii', 'lxiii', 'lxiv', 'lxv', 'lxvi', 'lxvii', 'lxviii', 'lxix', 'lxx',
'lxxi', 'lxxii', 'lxxiii', 'lxxiv', 'lxxv', 'lxxvi', 'lxxvii', 'lxxviii', 'lxxix', 'lxxx',
'lxxxi', 'lxxxii', 'lxxxiii', 'lxxxiv', 'lxxxv', 'lxxxvi', 'lxxxvii', 'lxxxviii', 'lxxxix', 'xc',
'xci', 'xcii', 'xciii', 'xciv', 'xcv', 'xcvi', 'xcvii', 'xcviii', 'xcix', 'c',
"one", "first", "two", "second", "three", "third",
"four", "fourth", "five", "fifth", "six", "sixth", "seven",
"seventh", "eight", "eighth", "nine", "ninth", "ten",
"tenth", "eleven", "eleventh", "twelve", "twelfth", "thirteen",
"thirteenth", "fourteen", "fourteenth", "fifteen", "fifteenth",
"sixteen", "sixteenth", "seventeen", "seventeenth", "eighteen",
"eighteenth", "nineteen", "nineteenth", "twenty", "twentieth",
"one", "22nd", "second", "nd", "st", "rd", "th",
"1","2","3","4","5","6","7","8","9","10th","11th","12th","13th","14th","15th",
"16th","17th","18th","19th","20th","21st","22nd","23rd","24th","25th","26th","27th",
"28th","29th","30th","31st","32nd","33rd","34th","35th","36th","37th","38th","39th",
"40th","41st","42nd","43rd","44th","45th","46th","47th","48th","49th","50th","51st",
"52nd","53rd","54th","55th","56th","57th","58th","59th","60th","61st","62nd","63rd",
"64th","65th","66th","67th","68th","69th","70th","71st","72nd","73rd","74th","75th",
"76th","77th","78th","79th","80th","81st","82nd","83rd","84th","85th","86th","87th",
"88th","89th","90th", "91st", "92nd", "93rd", "94th", "95th", "96th","97th", "98th",
"99th","100th","thirty","forty","fifty","thirty","thirtieth","forty","fortieth",
"fifty", "fiftiethiftieth","sixty","sixtieth","seventy","seventieth", "eighty",
"eightieth", "ninety", "ninetieth","one", "hundred", "100th", "hundredth",
"order","state","page","file",
"'d","'ll", "'m", "'re", "'s", "'ve", 'a',
'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all',
'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am',
'among', 'amongst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone',
'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be',
'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand',
'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'both',
'bottom', 'but', 'by', 'ca', 'call', 'can', 'cannot', 'could', 'did', 'do', 'does',
'doing', 'done', 'down', 'due', 'during', 'each', 'eight', 'either', 'eleven',
'else', 'elsewhere', 'empty', 'enough', 'even', 'ever', 'every', 'everyone',
'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'first',
'five', 'for', 'former', 'formerly', 'forty', 'four', 'from', 'front', 'full',
'further', 'get', 'give', 'go', 'had', 'has', 'have', 'he', 'hence', 'her',
'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself',
'his', 'how', 'however', 'hundred', 'i', 'if', 'in', 'indeed', 'into', 'is', 'it',
'its', 'itself', 'just', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'made',
'make', 'many', 'may', 'me', 'meanwhile', 'might', 'mine', 'more', 'moreover', 'most',
'mostly', 'move', 'much', 'must', 'my', 'myself', "n't", 'name', 'namely', 'neither',
'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not',
'nothing', 'now', 'nowhere', 'n‘t', 'n’t', 'of', 'off', 'often', 'on', 'once', 'one',
'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out',
'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'quite', 'rather', 're', 'really',
'regarding', 'same', 'say', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several',
'she', 'should', 'show', 'side', 'since', 'six', 'sixty', 'so', 'some', 'somehow', 'someone',
'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'take', 'ten', 'than',
'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter',
'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'third', 'this', 'those',
'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top',
'toward', 'towards', 'twelve', 'twenty', 'two', 'under', 'unless', 'until', 'up', 'upon', 'us',
'used', 'using', 'various', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when',
'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever',
'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will',
'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves', '‘d',
'‘ll', '‘m', '‘re', '‘s', '‘ve', '’d', '’ll', '’m', '’re', '’s', '’ve'
])))
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
stop_words = stopwords.words("english")
sentence = "PDF.co is a website that contains different tools to read, write and process PDF documents"
words = word_tokenize(sentence)
sentence_wo_stopwords = [word for word in words if not word in stop_words]
print(" ".join(sentence_wo_stopwords))