我对包“stringr”的功能有问题:str_extract_all 我想在一个字符向量中提取一个模式(在我的例子中是hashtags)。我的数据是:
'data.frame': 2858732 obs. of 15 variables:
$ created_at : Factor w/ 995761 levels "Fri Sep 12 00:00:00 +0000 2014",..: 164928 164929 164931 164931 164932 164937 164938 164938 164940 164940 ...
$ favorite_count : int 0 0 0 0 0 0 0 0 0 0 ...
$ favorited : Factor w/ 1 level "false": 1 1 1 1 1 1 1 1 1 1 ...
$ id : num 5.09e+17 5.09e+17 5.09e+17 5.09e+17 5.09e+17 ...
$ id_str : num 5.09e+17 5.09e+17 5.09e+17 5.09e+17 5.09e+17 ...
$ in_reply_to_status_id: Factor w/ 747393 levels "11111569142",..: 747393 747393 7983 747393 747393 7893 747393 7994 747393 747393 ...
$ in_reply_to_user_id : Factor w/ 594092 levels "1000001311","1000004054",..: 594092 594092 283452 594092 594092 39362 594092 87635 594092 594092 ...
$ lang : Factor w/ 60 levels "am","ar","bg",..: 13 13 13 13 13 13 13 13 13 13 ...
$ retweet_count : int 0 0 0 0 0 0 0 0 0 0 ...
$ retweeted : Factor w/ 1 level "false": 1 1 1 1 1 1 1 1 1 1 ...
$ text : chr "RT directe indirectecat Una nit dencartellada o perqu guanyarem http//tco/Sp09q6MVvq" "RT manelmarquez Grande TeresaForcadesF PConstituentLos poderosos no slo estn en Espaa tambin en Catalunya 9N2014 11S2014 htt" "AnastasyHope mas vale suspiro sino todo seria culpa mia aparco y quito las llaves del contacto" "RT VyvyanBasterd No s qu collons em passa avui per m'acabo de llevar amb una trempera sobrenatural vaig a aprofitar per a despe"| __truncated__ ...
$ user_screen_name : Factor w/ 3022692 levels "000000000096_",..: 929035 441496 1110467 741648 256996 569276 152104 2716367 2755620 2657050 ...
$ created : POSIXct, format: "2014-09-08 07:59:40" "2014-09-08 07:59:41" ...
$ created.day : POSIXct, format: "2014-09-08" "2014-09-08" ...
$ created.hours : POSIXct, format: "2014-09-08 08:00:00" "2014-09-08 08:00:00" ...
和我的剧本:
doc0_hashtags = str_extract_all(doc0_hash$text, "#\\w+")
功能有效,但效果不佳。输出全部为charcter(0)。我该如何纠正? 我尝试用另一种方法用这个函数提取主题标签:
extract.hashes = function(vec){
hash.pattern = "#[[:alpha:]]+"
have.hash = grep(x = vec, pattern = hash.pattern)
hash.matches = gregexpr(pattern = hash.pattern,
text = vec[have.hash])
extracted.hash = regmatches(x = vec[have.hash], m = hash.matches)
df = data.frame(table(tolower(unlist(extracted.hash))))
colnames(df) = c("tag","freq")
df = df[order(df$freq,decreasing = TRUE),]
return(df)
}
但是当我使用它时,我输出错误如下:
dat = head(extract.hashes(as.matrix(doc0_hash$text)),50)
Error in `colnames<-`(`*tmp*`, value = c("tag", "freq")) :
'names' attribute [2] must be the same length as the vector [1]
我的数据是这样的:
dput(head(doc0$text))
c("RT directe indirectecat Una nit dencartellada o perqu guanyarem http//tco/Sp09q6MVvq",
"RT manelmarquez Grande TeresaForcadesF PConstituentLos poderosos no slo estn en Espaa tambin en Catalunya 9N2014 11S2014 htt",
"AnastasyHope mas vale suspiro sino todo seria culpa mia aparco y quito las llaves del contacto",
"RT VyvyanBasterd No s qu collons em passa avui per m'acabo de llevar amb una trempera sobrenatural vaig a aprofitar per a despertar",
"RT FIL0S0FIA El amor no se manifiesta en el deseo de acostarse con alguien sino en el deseo de dormir junto a alguien",
"MMjencarlos no es sino bulgaro Pero las dos idiomas se parecen"
)
非常感谢