使用Python中的决策树进行文本分类

时间:2018-01-04 07:41:41

标签: python machine-learning classification decision-tree sklearn-pandas

我是Python和机器学习的新手。我的实施基于IEEE研究论文http://ieeexplore.ieee.org/document/7320414/(错误报告,功能请求,或者只是赞美?自动分类应用评论)

我想将文字分类。该文字是来自Google Play商店或苹果应用商店的用户评论。研究中使用的类别包括Bug,功能,用户体验,评级。鉴于这种情况,我试图在python中使用sklearn包实现决策树。我遇到了sklearn'IRIS'提供的示例数据集,它使用功能及其映射到目标的值构建树模型。在此示例中,它是数字数据。

我正在尝试对文本进行分类而不是数字数据。例子:

  1. 我非常喜欢升级到pdfs。但是,它们不再显示修复它并且它将是完美的 [BUG]
  2. 我希望如果我低于一定金额,我会通知我 [FEATURE]
  3. 此应用对我的业务 [评级]
  4. 非常有帮助
  5. 轻松找到歌曲并在iTunes中购买 [UserExperience]
  6. 鉴于这些文本以及对这些类别的更多用户评论,我想创建一个分类器,可以使用数据进行训练并预测任何给定用户评论的目标。

    到目前为止,我已经预处理了文本,并以包含预处理数据及其目标的元组列表的形式创建了训练数据。

    我的预处理:

    1. 将多行注释标记为单句
    2. 将每个句子标记为单词
    3. 删除标记化句子中的停用词
    4. 将标记化句子中的单词解释为
    5. (['我','喜欢','多','升级','pdfs','然而','显示','再','修复','完美'],“ BUG“

      这是我到目前为止所拥有的:

      import json
      from sklearn import tree
      from nltk.corpus import stopwords
      from nltk.stem import WordNetLemmatizer
      from nltk.tokenize import sent_tokenize, RegexpTokenizer
      
      # define a tokenizer to tokenize sentences and also remove punctuation
      tokenizer = RegexpTokenizer(r'\w+')
      
      # this list stores all the training data along with it's label
      tagged_tokenized_comments_corpus = []
      
      
      # Method: to add data to training set
      # Parameter: Tuple in the format (Data, Label)
      def tag_tokenized_comments_corpus(*tuple_data):
      tagged_tokenized_comments_corpus.append(tuple_data)
      
      
      # step 1: Load all the stop words from the nltk package
      stop_words = stopwords.words("english")
      stop_words.remove('not')
      
      # creating a temporary list to copy the existing stop words
      temp_stop_words = stop_words
      
      for word in temp_stop_words:
      if "n't" in word:
          stop_words.remove(word)
      
      # load the data set
      files = ["Bug.txt", "Feature.txt", "Rating.txt", "UserExperience.txt"]
      
      d = {"Bug": 0, "Feature": 1, "Rating": 2, "UserExperience": 3}
      
      for file in files:
      input_file = open(file, "r")
      file_text = input_file.read()
      json_content = json.loads(file_text)
      
      # step 3: Tokenize multi sentence into single sentences from the user comments
      comments_corpus = []
      for i in range(len(json_content)):
          comments = json_content[i]['comment']
          if len(sent_tokenize(comments)) > 1:
              for comment in sent_tokenize(comments):
                  comments_corpus.append(comment)
          else:
              comments_corpus.append(comments)
      
      # step 4: Tokenize each sentence, remove stop words and lemmatize the comments corpus
      lemmatizer = WordNetLemmatizer()
      tokenized_comments_corpus = []
      for i in range(len(comments_corpus)):
          words = tokenizer.tokenize(comments_corpus[i])
          tokenized_sentence = []
          for w in words:
              if w not in stop_words:
                  tokenized_sentence.append(lemmatizer.lemmatize(w.lower()))
          if tokenized_sentence:
              tokenized_comments_corpus.append(tokenized_sentence)
              tag_tokenized_comments_corpus(tokenized_sentence, d[input_file.name.split(".")[0]])
      
      # step 5: Create a dictionary of words from the tokenized comments corpus
      unique_words = []
      for sentence in tagged_tokenized_comments_corpus:
      for word in sentence[0]:
          unique_words.append(word)
      unique_words = set(unique_words)
      
      dictionary = {}
      i = 0
      for dict_word in unique_words:
      
      dictionary.update({i, dict_word})
      i = i + 1
      
      
      train_target = []
      train_data = []
      for sentence in tagged_tokenized_comments_corpus:
      train_target.append(sentence[0])
      train_data.append(sentence[1])
      
      clf = tree.DecisionTreeClassifier()
      clf.fit(train_data, train_target)
      
      test_data = "Beautiful Keep it up.. this far is the most usable app editor.. 
      it makes my photos more beautiful and alive.."
      
      test_words = tokenizer.tokenize(test_data)
      test_tokenized_sentence = []
      for test_word in test_words:
          if test_word not in stop_words:
           test_tokenized_sentence.append(lemmatizer.lemmatize(test_word.lower()))
      
      #predict using the classifier
      print("predicting the labels: ")
      print(clf.predict(test_tokenized_sentence))
      

      但是,这似乎不起作用,因为它在我们训练算法时在运行时抛出错误。我在想如果我可以将元组中的单词映射到字典并将文本转换为数字形式并训练算法。但我不确定这是否可行。

      任何人都可以建议我如何修复此代码?或者,如果有更好的方法来实现这个决策树。

      Traceback (most recent call last):
        File "C:/Users/venka/Documents/GitHub/RE-18/Test.py", line 87, in <module>
      clf.fit(train_data, train_target)
        File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\tree\tree.py", line 790, in fit
      X_idx_sorted=X_idx_sorted)
       File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\tree\tree.py", line 116, in fit
      X = check_array(X, dtype=DTYPE, accept_sparse="csc")
      File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 441, in check_array
      "if it contains a single sample.".format(array))
      ValueError: Expected 2D array, got 1D array instead:
      array=[ 0.  0.  0. ...,  3.  3.  3.].
      Reshape your data either using array.reshape(-1, 1) if your data has a 
      single feature or array.reshape(1, -1) if it contains a single sample.
      

1 个答案:

答案 0 :(得分:2)

决策树只能在特征向量长度相同时才能工作。就个人而言,我对于决策树在这样的文本分析中的有效性一无所知,但是如果你想尝试去做,那么我建议的方式是&#34 ;一个热&#34; &#34;词袋&#34;风格矢量。

基本上,保持标记在您的示例中出现的单词次数,并将它们放在代表整个语料库的向量中。比如说,一旦你删除了所有停止词,整个语料库的集合就是:

{"Apple", "Banana", "Cherry", "Date", "Eggplant"}

您可以通过与语料库大小相同的向量来表示它,每个值表示该单词是否出现。在我们的示例中,一个5长度向量,其中第一个元素与"Apple"相关联,第二个元素与"Banana"相关联,依此类推。你可能得到类似的东西:

bag("Apple Banana Date")
#: [1, 1, 0, 1, 0]
bag("Cherry")
#: [0, 0, 1, 0, 0]
bag("Date Eggplant Banana Banana")
#: [0, 1, 0, 1, 1]
# For this case, I have no clue if Banana having the value 2 would improve results.
# It might. It might not. Something you'd need to test.

这样,无论输入如何,您都具有相同大小的向量,并且决策树知道在哪里查找某些输出。假设"Banana"与错误报告强烈对应,在这种情况下,决策树将知道第二个元素中的1意味着错误报告更有可能。

当然,你的语料库可能长达数千字。在这种情况下,您的决策树可能不会成为这项工作的最佳工具。除非您先花一些时间来减少功能,否则不会这样做。