为什么谷歌自然语言会为分析的字符串返回不正确的beginOffset?

时间:2017-02-15 12:56:00

标签: javascript string offset sentiment-analysis google-language-api

我正在使用google-cloud / language api进行#annotate调用,并从我从各种在线资源中获取的csv评论中分析实体和情绪。

首先,我试图分析的字符串包括commentId,所以我重新格式化了这个:

youtubez22htrtb1ymtdlka404t1aokg2kirffb53u3pya0,i just bot a Nostromo... ( ._.)
youtubez22oet0bruejcdf0gacdp431wxg3vb2zxoiov1da,Good Job Baby! MSI Propeller Blade Technology!
youtubez22ri11akra4tfku3acdp432h1qyzap3yy4ziifc,"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"
youtubez23ttpsyolztc1ep004t1aokg5zuyqxfqykgyjqs,"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's. Nice Alienware thing logo thing, btw"
youtubez12zjp5rupbcttvmy220ghf4ctqnerqwa04,"You know, If you actually made this. People would actually buy it."

因此它不包含任何评论ID:

I just bot a Nostromo... ( ._.)
Good Job Baby! MSI Propeller Blade Technology!\n"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"
"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's.   Nice Alienware thing logo thing, btw"
"You know, If you actually made this. People would actually buy it."

发送google云/语言请求后#notnotate文本。我收到的回复包括各种子串情绪和数量。每个字符串也被赋予beginOffset值,该值与原始字符串中的字符串索引(请求中的字符串)相关。

{ content: 'i just bot a Nostromo... ( ._.)\nGood Job Baby!',
  beginOffset: 0 }
{ content: 'MSI Propeller Blade Technology!\n"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"\n"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's.',
  beginOffset: 50 }
{ content: 'Nice Alienware thing logo thing, btw"\n"You know, If you actually made this.',
  beginOffset: 462 }

我的目标是在原始字符串中找到原始注释,这应该足够简单。像(originalString[beginOffset]) .....

这样的东西

此值不正确!

我假设他们没有包含某些角色,但我尝试过多种正则表达式,似乎没有任何效果。有没有人知道可能导致问题的原因???

3 个答案:

答案 0 :(得分:0)

这与编码有关。尝试其中一种编码,或者简单地使用其github回购中提供的示例方法之一:

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/language/api/analyze.py

关键代码块:


def get_native_encoding_type():
    """Returns the encoding type that matches Python's native strings."""
    if sys.maxunicode == 65535:
        return 'UTF16'
    else:
        return 'UTF32'

这对我有用。搞砸了'这样的字符(在Unicode中为\ u2019)。

答案 1 :(得分:0)

我知道这是一个古老的问题,但是这个问题似乎一直持续到今天。我最近遇到了相同的问题,并通过将Google的偏移量解释为“字节偏移量”而不是所选编码中的字符串偏移量来解决了该问题。效果很好。希望对您有所帮助。

以下是一些C#代码,但是任何人都应该能够解释它并以自己喜欢的语言重新编码。如果我们假设text实际上是正在分析的情感文本,则下面的代码将转换为Google的偏移量为正确的偏移量。

int TransformOffset(string text, int offset)
{
   return Encoding.UTF8.GetString(
             Encoding.UTF8.GetBytes(text),
             0,
             offset)
          .Length;
}

答案 2 :(得分:0)

您应该在请求中设置 EncodingType。

使用 Java 客户端库和处理 UTF-8 编码文本的示例:

Document doc = Document.newBuilder().setContent(dreamText).setType(Type.PLAIN_TEXT).build();
        
AnalyzeEntitiesRequest request = AnalyzeEntitiesRequest.newBuilder().setEncodingType(EncodingType.UTF8).setDocument(doc).build();