Question

我正在使用Twitter REST API 1.1（user_timeline.json）在我的R脚本中工作。我收集了大量的推文。

不幸的是，这些文字包含很多特殊字符，例如\n，^或单\。到目前为止，我可以在通过fromJSON（jsonlite包）导入它们之前用str_replace_all或gsub替换它们：

correctJSON <- function(string) {
  string <- str_replace_all(string, pattern = perl('\\\\(?![tn"])'), replacement = " ")
  string <- str_replace_all(string, pattern = "\n", replacement = " ")
  string <- str_replace_all(string, pattern = "\r", replacement = " ")
  string <- str_replace_all(string, pattern = "\\^", replacement = " ")
  return(string)
}

现在我有一个包含\xed或\xa0等特殊字符的字符串。尝试导入时（通过fromJSON(correctJSON(string))），我得到correctJSON函数的错误：

Fehler in parseJSON(txt) : lexical error: invalid bytes in UTF8 string.
      uch sind.Mutig von bd. Seiten�������������������������������
                 (right here) ------^

包含有问题字符的推文是AFAICS：

[{\"created_at\":\"Fri Feb 07 18:35:02 +0000 2014\",\"id\":431858659656990721,\"id_str\":\"431858659656990721\",\"text\":\"RT @FHubersr: @peteraltmaier //die Schwarz-Grünen werden zeigen, daß sich Ökologie und Ökonomie vertragen und kein Widerspruch sind.Mutig v…\",\"source\":\"<a href=\\\"http://twitter.com/download/iphone\\\" rel=\\\"nofollow\\\">Twitter for iPhone</a>\",\"truncated\":false,\"in_reply_to_status_id\":null,\"in_reply_to_status_id_str\":null,\"in_reply_to_user_id\":null,\"in_reply_to_user_id_str\":null,\"in_reply_to_screen_name\":null,\"user\":{\"id\":378693834,\"id_str\":\"378693834\"},\"geo\":null,\"coordinates\":null,\"place\":null,\"contributors\":null,\"retweeted_status\":{\"created_at\":\"Fri Feb 07 18:32:30 +0000 2014\",\"id\":431858022366064640,\"id_str\":\"431858022366064640\",\"text\":\"@peteraltmaier //die Schwarz-Grünen werden zeigen, daß sich Ökologie und Ökonomie vertragen und kein Widerspruch sind.Mutig von bd. Seiten\xed\xa0\xbd\xed\xb1\x8d\xed\xa0\xbd\xed\xb8\x8e\",\"source\":\"<a href=\\\"http://twitter.com/download/iphone\\\" rel=\\\"nofollow\\\">Twitter for iPhone</a>\",\"truncated\":false,\"in_reply_to_status_id\":431845492579123201,\"in_reply_to_status_id_str\":\"431845492579123201\",\"in_reply_to_user_id\":378693834,\"in_reply_to_user_id_str\":\"378693834\",\"in_reply_to_screen_name\":\"peteraltmaier\",\"user\":{\"id\":2172292811,\"id_str\":\"2172292811\"},\"geo\":null,\"coordinates\":null,\"place\":null,\"contributors\":null,\"retweet_count\":3,\"favorite_count\":4,\"favorited\":false,\"retweeted\":false,\"lang\":\"de\"},\"retweet_count\":3,\"favorite_count\":0,\"favorited\":false,\"retweeted\":false,\"lang\":\"de\"}]

我已经尝试了很多东西，但即使在阅读了一些线程后，我也无法提出一个可以替代所有有问题的特殊字符的解决方案。

注意：当我想通过fromJSON导入单个推文时，我感觉非常有趣，我没有收到错误。但是只要我导入了正确的JSON字符串，它就会抛出错误。但是我需要correctJSON，因为有许多人出现......

PS：我只粘贴了有问题的推文。在这里，您可以看到我的API调用的整个输出也包含此输出：https://p.mehl.mx/?53c04753c247a48a#5w+HtSCYpcjRwSk0PdsP3P1w3u+Z22/f6GKMJRoW//8=

感谢您的帮助！

Answer 1

好的，我找到了一个可能的答案，这个答案适用于我目前收集的前5000条推文：

correctJSON <- function(string) {
  string <- str_replace_all(string, pattern = "[^[:print:]]", replacement = " ")
  string <- str_replace_all(string, pattern = perl('\\\\(?![tn"])'), replacement = " ")
  return(string)
}

正则表达式[^[:print:]]适用于\xed，\n等特殊字符，也可能适用于\U....。仅对于单\，您需要第二个（perl）正则表达式。

所以它现在有效，希望也可以导入即将发布的许多推文。如果出现意外情况，我会编辑。

替换特殊字符以使JSON API输出有效

1 个答案: