Question

我正在使用google-cloud / language api进行#annotate调用，并从我从各种在线资源中获取的csv评论中分析实体和情绪。

首先，我试图分析的字符串包括commentId，所以我重新格式化了这个：

youtubez22htrtb1ymtdlka404t1aokg2kirffb53u3pya0,i just bot a Nostromo... ( ._.)
youtubez22oet0bruejcdf0gacdp431wxg3vb2zxoiov1da,Good Job Baby! MSI Propeller Blade Technology!
youtubez22ri11akra4tfku3acdp432h1qyzap3yy4ziifc,"exactly, i have to deal with that damned brick, and the power supply can&#39;t be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51&#39;s"
youtubez23ttpsyolztc1ep004t1aokg5zuyqxfqykgyjqs,"I like how people are liking your comment about liking the fact that Sky DID put Deadlox&#39;s channel in the description instead of Ryan&#39;s. Nice Alienware thing logo thing, btw"
youtubez12zjp5rupbcttvmy220ghf4ctqnerqwa04,"You know, If you actually made this. People would actually buy it."

因此它不包含任何评论ID：

I just bot a Nostromo... ( ._.)
Good Job Baby! MSI Propeller Blade Technology!\n"exactly, i have to deal with that damned brick, and the power supply can&#39;t be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51&#39;s"
"I like how people are liking your comment about liking the fact that Sky DID put Deadlox&#39;s channel in the description instead of Ryan&#39;s.   Nice Alienware thing logo thing, btw"
"You know, If you actually made this. People would actually buy it."

发送google云/语言请求后#notnotate文本。我收到的回复包括各种子串情绪和数量。每个字符串也被赋予beginOffset值，该值与原始字符串中的字符串索引（请求中的字符串）相关。

{ content: 'i just bot a Nostromo... ( ._.)\nGood Job Baby!',
  beginOffset: 0 }
{ content: 'MSI Propeller Blade Technology!\n"exactly, i have to deal with that damned brick, and the power supply can&#39;t be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51&#39;s"\n"I like how people are liking your comment about liking the fact that Sky DID put Deadlox&#39;s channel in the description instead of Ryan&#39;s.',
  beginOffset: 50 }
{ content: 'Nice Alienware thing logo thing, btw"\n"You know, If you actually made this.',
  beginOffset: 462 }

我的目标是在原始字符串中找到原始注释，这应该足够简单。像(originalString[beginOffset]) .....

这样的东西

此值不正确！

我假设他们没有包含某些角色，但我尝试过多种正则表达式，似乎没有任何效果。有没有人知道可能导致问题的原因???

Answer 1

这与编码有关。尝试其中一种编码，或者简单地使用其github回购中提供的示例方法之一：

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/language/api/analyze.py

关键代码块：


def get_native_encoding_type():
    """Returns the encoding type that matches Python's native strings."""
    if sys.maxunicode == 65535:
        return 'UTF16'
    else:
        return 'UTF32'

这对我有用。搞砸了'这样的字符（在Unicode中为\ u2019）。

Answer 2

我知道这是一个古老的问题，但是这个问题似乎一直持续到今天。我最近遇到了相同的问题，并通过将Google的偏移量解释为“字节偏移量”而不是所选编码中的字符串偏移量来解决了该问题。效果很好。希望对您有所帮助。

以下是一些C＃代码，但是任何人都应该能够解释它并以自己喜欢的语言重新编码。如果我们假设text实际上是正在分析的情感文本，则下面的代码将转换为Google的偏移量为正确的偏移量。

int TransformOffset(string text, int offset)
{
   return Encoding.UTF8.GetString(
             Encoding.UTF8.GetBytes(text),
             0,
             offset)
          .Length;
}

Answer 3

您应该在请求中设置 EncodingType。

使用 Java 客户端库和处理 UTF-8 编码文本的示例：

Document doc = Document.newBuilder().setContent(dreamText).setType(Type.PLAIN_TEXT).build();
        
AnalyzeEntitiesRequest request = AnalyzeEntitiesRequest.newBuilder().setEncodingType(EncodingType.UTF8).setDocument(doc).build();

为什么谷歌自然语言会为分析的字符串返回不正确的beginOffset？

3 个答案: