Question

我正在尝试使用Spacy的NER为文本从文本中提取命名实体。我已将该服务公开为REST POST请求，该请求将源文本作为输入并返回命名实体列表（人员，位置，组织）的字典（Map）。使用托管在Linux服务器上的Flask Restplus公开这些服务。

考虑一个示例文本，我在通过Swagger UI公开的REST API中使用POST请求获得以下响应：

{
  "ner_locations": [
    "Deutschland",
    "Niederlanden"
  ],
  "ner_organizations": [
    "Miele & Cie. KG",
    "Bayer CropScience AG"
  ],
  "ner_persons": [
    "Sebastian Krause",
    "Alex Schröder"
  ]
}

当我使用Spring的RestTemplate从Linux启动应用程序（在Eclipse中的Windows操作系统）上的Linux服务器上托管的API请求时。 json解析正确完成。我添加了以下使用UTF-8编码的行。

restTemplate.getMessageConverters().add(0, new StringHttpMessageConverter(Charset.forName("UTF-8")));

但是当我在linux机器上部署这个Spring启动应用程序并对API进行NER标记的POST请求时，ner_persons没有被正确解析。远程调试时，我得到以下回复

{
  "ner_locations": [
    "Deutschland",
    "Niederlanden"
  ],
  "ner_organizations": [
    "Miele & Cie. KG",
    "Bayer CropScience AG"
  ],
  "ner_persons": [
    "Sebastian ",
    "Krause",
    "Alex ",
    "Schröder"
  ]
}

我无法理解为什么这种奇怪的行为发生在人而非组织的情况下。

Answer 1

作为python的新手，我花了2天时间调试才能理解真正的问题并找到解决方法。

原因是这些名字（例如，＆＃34; Sebastian Krause＆＃34;）被 \ xa0 分开，即不间断的空格字符（例如，＆＃34; Sebastian \ xa0Krause＆＃34;）而不是空格。因此，Spacy未能将其视为一个NamedEntity。

浏览SO，我找到了来自here的以下解决方案：

import unicodedata 
norm_text = unicodedata.normalize("NFKD", source_text)

这也会规范化其他unicode字符，例如 \ u2013 ， \ u2026 等。

Rest模板无法正确解析json rest api响应

1 个答案: