使用str_extract_all从模式中提取模式作为主题标签

时间:2014-11-28 15:16:30

标签: r twitter vector character stringr

我对包“stringr”的功能有问题:str_extract_all 我想在一个字符向量中提取一个模式(在我的例子中是hashtags)。我的数据是:

'data.frame':   2858732 obs. of  15 variables:
 $ created_at           : Factor w/ 995761 levels "Fri Sep 12 00:00:00 +0000 2014",..: 164928 164929 164931 164931 164932 164937 164938 164938 164940 164940 ...
 $ favorite_count       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ favorited            : Factor w/ 1 level "false": 1 1 1 1 1 1 1 1 1 1 ...
 $ id                   : num  5.09e+17 5.09e+17 5.09e+17 5.09e+17 5.09e+17 ...
 $ id_str               : num  5.09e+17 5.09e+17 5.09e+17 5.09e+17 5.09e+17 ...
 $ in_reply_to_status_id: Factor w/ 747393 levels "11111569142",..: 747393 747393 7983 747393 747393 7893 747393 7994 747393 747393 ...
 $ in_reply_to_user_id  : Factor w/ 594092 levels "1000001311","1000004054",..: 594092 594092 283452 594092 594092 39362 594092 87635 594092 594092 ...
 $ lang                 : Factor w/ 60 levels "am","ar","bg",..: 13 13 13 13 13 13 13 13 13 13 ...
 $ retweet_count        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ retweeted            : Factor w/ 1 level "false": 1 1 1 1 1 1 1 1 1 1 ...
 $ text                 : chr  "RT directe indirectecat Una nit dencartellada o perqu guanyarem http//tco/Sp09q6MVvq" "RT manelmarquez Grande TeresaForcadesF PConstituentLos poderosos no slo estn en Espaa tambin en Catalunya 9N2014 11S2014 htt" "AnastasyHope mas vale suspiro sino todo seria culpa mia aparco y quito las llaves del contacto" "RT VyvyanBasterd No s qu collons em passa avui per m'acabo de llevar amb una trempera sobrenatural vaig a aprofitar per a despe"| __truncated__ ...
 $ user_screen_name     : Factor w/ 3022692 levels "000000000096_",..: 929035 441496 1110467 741648 256996 569276 152104 2716367 2755620 2657050 ...
 $ created              : POSIXct, format: "2014-09-08 07:59:40" "2014-09-08 07:59:41" ...
 $ created.day          : POSIXct, format: "2014-09-08" "2014-09-08" ...
 $ created.hours        : POSIXct, format: "2014-09-08 08:00:00" "2014-09-08 08:00:00" ...

和我的剧本:

doc0_hashtags = str_extract_all(doc0_hash$text, "#\\w+")

功能有效,但效果不佳。输出全部为charcter(0)。我该如何纠正? 我尝试用另一种方法用这个函数提取主题标签:

extract.hashes = function(vec){

  hash.pattern = "#[[:alpha:]]+"
  have.hash = grep(x = vec, pattern = hash.pattern)

  hash.matches = gregexpr(pattern = hash.pattern,
                          text = vec[have.hash])
  extracted.hash = regmatches(x = vec[have.hash], m = hash.matches)

  df = data.frame(table(tolower(unlist(extracted.hash))))
  colnames(df) = c("tag","freq")
  df = df[order(df$freq,decreasing = TRUE),]
  return(df)
}

但是当我使用它时,我输出错误如下:

dat = head(extract.hashes(as.matrix(doc0_hash$text)),50)


Error in `colnames<-`(`*tmp*`, value = c("tag", "freq")) :
  'names' attribute [2] must be the same length as the vector [1]

我的数据是这样的:

dput(head(doc0$text))

c("RT directe indirectecat Una nit dencartellada o perqu guanyarem http//tco/Sp09q6MVvq",
"RT manelmarquez Grande TeresaForcadesF PConstituentLos poderosos no slo estn en Espaa tambin en Catalunya 9N2014 11S2014 htt",
"AnastasyHope mas vale suspiro sino todo seria culpa mia aparco y quito las llaves del contacto",
"RT VyvyanBasterd No s qu collons em passa avui per m'acabo de llevar amb una trempera sobrenatural vaig a aprofitar per a despertar",
"RT FIL0S0FIA El amor no se manifiesta en el deseo de acostarse con alguien sino en el deseo de dormir junto a alguien",
"MMjencarlos no es   sino bulgaro Pero las dos idiomas se parecen"
)

非常感谢

0 个答案:

没有答案