Question

我正在分析几百万封电子邮件。我的目标是能够将其分类成小组。群组可以是例如：

投放问题（投放缓慢，发货前处理速度慢，可用信息不正确等）
客户服务问题（电子邮件响应时间慢，回复不礼貌等）
退货问题（回复请求处理速度慢，客户服务缺乏帮助等）
定价投诉（隐藏费用已发现等）

为了执行这种分类，我需要一个可以识别单词组合的NLP，如：

“[他们|公司|公司|网站|商家]”
“[没有|没有|没有]”
“[响应|响应|答案|回复]”
“[第二天之前|足够快] |
等

这些示例组中的一些组合应该匹配如下句子：

“他们没有回应”
“他们根本没有回应”
“根本没有回应”
“我没有收到网站的回复”

然后将句子归类为客户服务问题。

哪个NLP能够处理这样的任务？从我读到的这些是最相关的：

Stanford CoreNLP
OpenNLP

同时检查these suggested NLP's。

Answer 1

使用OpenNLP doccat api，您可以创建训练数据，然后根据训练数据创建模型。这个优于朴素贝叶斯分类器的优势在于它返回了一组概率分布。

所以如果你用这种格式创建一个文件：

customerserviceproblems They did not respond
customerserviceproblems They didn't respond 
customerserviceproblems They didn't respond at all
customerserviceproblems They did not respond at all
customerserviceproblems I received no response from the website
customerserviceproblems I did not receive response from the website

等....提供尽可能多的样本，并确保每行以\ n换行符结束

使用此appoach，您可以添加任何您想要的“客户服务问题”，您也可以添加任何其他类别，因此您不必过于确定哪些数据属于哪些类别

这是java构建模型的样子

DoccatModel model = null;
    InputStream dataIn = new FileInputStream(yourFileOfSamplesLikeAbove);
    try {

      ObjectStream<String> lineStream =  
              new PlainTextByLineStream(dataIn, "UTF-8");

      ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
      model = DocumentCategorizerME.train("en", sampleStream);
      OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelOutFile));
      model.serialize(modelOut);
      System.out.println("Model complete!");
    } catch (IOException e) {
      // Failed to read or parse training data, training failed
      e.printStackTrace();
    }

获得模型后，您可以使用以下内容：

DocumentCategorizerME documentCategorizerME;
  DoccatModel doccatModel; 

doccatModel = new DoccatModel(new File(pathToModelYouJustMade));
   documentCategorizerME = new DocumentCategorizerME(doccatModel);
 /**
 * returns a map of a category to a score
 * @param text
 * @return
 * @throws Exception 
 */
  private Map<String, Double> getScore(String text) throws Exception {
    Map<String, Double> scoreMap = new HashMap<>();
    double[] categorize = documentCategorizerME.categorize(text);
    int catSize = documentCategorizerME.getNumberOfCategories();
    for (int i = 0; i < catSize; i++) {
      String category = documentCategorizerME.getCategory(i);
      scoreMap.put(category, categorize[documentCategorizerME.getIndex(category)]);
    }
    return scoreMap;

  }

然后在返回的hashmap中，您拥有了您建模的每个类别和一个分数，您可以使用这些分数来决定输入文本属于哪个类别。

Answer 2

不完全确定，但我可以想到两种尝试解决问题的方法：

标准机器学习

如评论中所述，仅从每封邮件中提取关键字并使用它们训练分类器。事先定义相关的关键字集，并仅在电子邮件中提取这些关键字。

这是一种简单但功能强大的技术，不容小觑，因为它在很多情况下会产生非常好的效果。你可能想先尝试这个，因为更复杂的算法可能有点过分。
<强>文法

如果您真的想深入研究NLP，根据您的问题描述，您可能会尝试定义某种语法并根据该语法解析电子邮件。我对ruby没有太多经验，但我确信存在某种类似于lex-yacc的工具。快速网络搜索会提供this SO question和this。通过识别这些短语，您可以通过计算每个类别的短语比例来判断电子邮件属于哪个类别。

例如，直观地说，语法中的一些产品可以定义为：
```
{organization}{negative}{verb} :- delivery problems
```
其中organization = [they|the company|the firm|the website|the merchant]等

这些方法可能是一种开始的方式。

NLP对句子内容进行分类/标记（必要的Ruby绑定）

2 个答案: