我正在尝试标记我的数据,但我一直在努力。请注意,我是NLP的新手。
这是我的数据(称为垃圾邮件)的样子:
Out[8]:
text
0 Free entry in 2 a wkly comp to win FA Cup fina...
1 FreeMsg Hey there darling it's been 3 week's n...
2 WINNER!! As a valued network customer you have...
3 Had your mobile 11 months or more? U R entitle...
4 SIX chances to win CASH! From 100 to 20,000 po...
这是我到目前为止所尝试的。
def tokenize(text):
tokens = [token for token in simple_preprocess(text)
if token not in STOPWORDS]
return [token for token in tokens
if token not in custom_stopwords]
tokenize(spam)
运行此命令时,出现以下错误:
TypeError:解码为str:需要一个类似字节的对象,找到了DataFrame
所以我尝试这样解码:
open(spam).read().decode('utf-8')
但这也会产生错误:
TypeError:预期的str,字节或os.PathLike对象,而不是DataFrame
因此,通过查看错误,我发现问题是垃圾邮件是一个数据帧,但我不知道该怎么办。
我也尝试使用the nltk.tokenize()
函数,但这给了我另一个错误
TypeError:“模块”对象不可调用
答案 0 :(得分:0)
您应该将函数应用于列:
<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
<link rel="stylesheet" href="Mod3Layout.css">
<meta charset="utf-8">
<title>Sean's Mad Lib</title>
</head>
<body>
<h1> Sean's Wacky Mad Lib</h1><hr>
<div id="prompts">
<h3>Please enter your prompts here</h3>
<p>Enter a name here:
<input id="name" type="text" placeholder="name">
</p>
<p>Enter a verb here:
<input id="firstverb" type="text" placeholder="verb 1">
</p>
<p>Enter a noun here:
<input id="firstnoun" type="text" placeholder="noun 1">
</p>
<p>Enter an adjective here:
<input id="adjective" type="text" placeholder="adjective">
</p>
<p>Enter another noun here:
<input id="secondnoun" type="text" placeholder="noun 2">
</p>
<p>Enter an adverb here:
<input id="adverb" type="text" placeholder="adverb">
</p>
<p>Finally, Enter a place here:
<input id="place" type="text" placeholder="place"
</p><br>
<button id="submit" type="button">Submit</button>
<p id="error">You did not answer all the questions. Please try
again</p>
</div>
<div id="story">
<p>Let's see what you wrote.</p>
<p id="storyOutput">Hello Dave</p>
<button id="reset" type="button" name="Reset">Reset</button>
</div>