令牌化和解码

时间:2019-03-13 13:58:34

标签: pandas utf-8 nltk token

我正在尝试标记我的数据,但我一直在努力。请注意,我是NLP的新手。

这是我的数据(称为垃圾邮件)的样子:

Out[8]: 
                                            text
0  Free entry in 2 a wkly comp to win FA Cup fina...
1  FreeMsg Hey there darling it's been 3 week's n...
2  WINNER!! As a valued network customer you have...
3  Had your mobile 11 months or more? U R entitle...
4  SIX chances to win CASH! From 100 to 20,000 po...

这是我到目前为止所尝试的。

def tokenize(text):
    tokens = [token for token in simple_preprocess(text) 
                              if token not in STOPWORDS]
    return [token for token in tokens 
                            if token not in custom_stopwords]
tokenize(spam)

运行此命令时,出现以下错误:

  

TypeError:解码为str:需要一个类似字节的对象,找到了DataFrame

所以我尝试这样解码:

open(spam).read().decode('utf-8')

但这也会产生错误:

  

TypeError:预期的str,字节或os.PathLike对象,而不是DataFrame

因此,通过查看错误,我发现问题是垃圾邮件是一个数据帧,但我不知道该怎么办。

我也尝试使用the nltk.tokenize()函数,但这给了我另一个错误

  

TypeError:“模块”对象不可调用

1 个答案:

答案 0 :(得分:0)

您应该将函数应用于列:

<!DOCTYPE html>
<html lang="en" dir="ltr">
  <head>
    <link rel="stylesheet" href="Mod3Layout.css">
    <meta charset="utf-8">
    <title>Sean's Mad Lib</title>
  </head>
  <body>
    <h1> Sean's Wacky Mad Lib</h1><hr>

    <div id="prompts">
      <h3>Please enter your prompts here</h3>
      <p>Enter a name here:
        <input id="name" type="text" placeholder="name">
        </p>
        <p>Enter a verb here:
          <input id="firstverb" type="text" placeholder="verb 1">
          </p>
          <p>Enter a noun here:
            <input id="firstnoun" type="text" placeholder="noun 1">
            </p>
            <p>Enter an adjective here:
              <input id="adjective" type="text" placeholder="adjective">
             </p>
             <p>Enter another noun here:
               <input id="secondnoun" type="text" placeholder="noun 2">
             </p>
             <p>Enter an adverb here:
               <input id="adverb" type="text" placeholder="adverb">
             </p>
             <p>Finally, Enter a place here:
               <input id="place" type="text" placeholder="place"
               </p><br>
             <button id="submit" type="button">Submit</button>
             <p id="error">You did not answer all the questions. Please try
               again</p>
      </div>
      <div id="story">
        <p>Let's see what you wrote.</p>
        <p id="storyOutput">Hello Dave</p>
        <button id="reset" type="button" name="Reset">Reset</button>
        </div>