Question

R新手在这里我的数据看起来像这样：

{'id': 19847005, 'profile_sidebar_fill_color': u'http://pbs.foo.com/profile_background', 'profile_text_color': u'333333', 'followers_count': 1105, 'location': u'San Diego, CA', 'profile_background_color': u'9AE4E8', 'listed_count': 43, '009', 'time_zone': u'Pacific Time (US & Canada)', 'protected': False}

我想从此文本中提取位置数据：加利福尼亚州圣地亚哥。

我一直在尝试使用这个stringr包来实现这一目标，但是不能正确地获得正则表达式来捕获城市和州。有时状态会出现，有时则不存在。

location_pattern <- "'location':\su'(\w+)'"
rawdata$location <- str_extract(rawdata$user, location_pattern)

Answer 1

你可以尝试

str_extract_all(str1, perl("(?<=location.: u.)[^']+(?=')"))[[1]]
#[1] "San Diego, CA"

Answer 2

它看起来像一个json字符串，但是如果你不太关心它，那么也许这会有所帮助。

library(stringi)

ss <- stri_split_regex(x, "[{}]|u?'|(, '(009')?)|: ", omit=TRUE)[[1]]
(m <- matrix(ss, ncol = 2, byrow = TRUE))
#      [,1]                         [,2]                                   
# [1,] "id"                         "19847005"                             
# [2,] "profile_sidebar_fill_color" "http://pbs.foo.com/profile_background"
# [3,] "profile_text_color"         "333333"                               
# [4,] "followers_count"            "1105"                                 
# [5,] "location"                   "San Diego, CA"                        
# [6,] "profile_background_color"   "9AE4E8"                               
# [7,] "listed_count"               "43"                                   
# [8,] "time_zone"                  "Pacific Time (US & Canada)"           
# [9,] "protected"                  "False"

现在您在左栏中有ID名称，在右侧有值。如果需要，从这一点重新组装json可能很简单。

此外，关于json-ness，我们可以将m强制转换为data.frame（或将其保留为矩阵），然后使用jsonlite::toJSON

library(jsonlite)
json <- toJSON(setNames(as.data.frame(m), c("ID", "Value")))
fromJSON(json)
#                           ID                                 Value
# 1                         id                              19847005
# 2 profile_sidebar_fill_color http://pbs.foo.com/profile_background
# 3         profile_text_color                                333333
# 4            followers_count                                  1105
# 5                   location                         San Diego, CA
# 6   profile_background_color                                9AE4E8
# 7               listed_count                                    43
# 8                  time_zone            Pacific Time (US & Canada)
# 9                  protected                                 False

Answer 3

其他人已经提供了可能的解决方案，但未解释您的尝试可能出现的问题。

str_extract函数使用不理解\w和\s的POSIX扩展正则表达式，这些正则表达式特定于Perl正则表达式。您可以使用stringr包中的perl函数，然后它会识别快捷方式，或者您可以使用[[:space:]]代替\s和[[:alnum:]_]代替{ {1}}虽然您更希望获得\w或[[:alpha], ]等内容。

此外，R的字符串解析器会在将字符串传递给匹配函数之前对其进行处理，因此如果使用[^']函数，则需要\\s和\\w（或R）中的其他正则表达式函数。第一个perl转义第二个，以便单个\保留在字符串中，以解释为正则表达式的一部分。

使用R中的regex提取位置数据

3 个答案: