计算大型数据框(RStudio)中的单词和单词词干

时间:2019-01-08 10:35:02

标签: r dictionary text text-mining stringr

我有一个由推文组成的大型数据框,以及一个以列表形式加载的关键字词典,其中包含与情感相关的单词和词干(kw_Emo)。 我需要找到一种方法来计算每个推文中来自kw_Emo的给定单词/单词词干出现的次数。kw_Emo中,单词词干用星号( *)。例如,一个词干是ador*,这意味着我需要说明adorableadoreadoring或任何 ador…开始。


在之前的堆栈溢出讨论中(请参阅我的个人资料上的问题),以下解决方案对我有很大帮助,但是它仅计算完全匹配的字符(例如,仅ador,而不是adorable ):

  1. 加载相关程序包。

    library(stringr)

  2. *的词干中识别并删除kw_Emo

    for (x in 1:length(kw_Emo)) { if (grepl("[*]", kw_Emo[x]) == TRUE) { kw_Emo[x] <- substr(kw_Emo[x],1,nchar(kw_Emo[x])-1) }     }

  3. 创建新列,对于每个词/来自kw_Emo的词/词,默认值为0。

    for (x in 1:length(keywords)) { dataframe[, keywords[x]] <- 0}

  4. 将每个Tweet拆分为单词向量,查看关键字是否等于任何关键词,并在相应的单词/单词词干列中添加+1。

    for (x in 1:nrow(dataframe)) { partials <- data.frame(str_split(dataframe[x,2], " "), stringsAsFactors=FALSE) partials <- partials[partials[] != ""] for(y in 1:length(partials)) { for (z in 1:length(keywords)) { if (keywords[z] == partials[y]) { dataframe[x, keywords[z]] <- dataframe[x, keywords[z]] + 1 } } } }

是否有一种方法可以更改此解决方案以说明词干?我想知道是否有可能首先使用字符串模式用确切的字符替换词干的出现,然后再使用完全匹配的解决方案。例如,类似stringr::str_replace_all(x, "ador[a-z]+", "ador")的东西。但是我不确定如何使用大型词典和大量词干来执行此操作。也许可以以某种方式适应删除[*]的循环,该循环实际上标识了所有词干。


这是我的数据框的可复制示例,称为TestTweets,其中要分析的文本位于名为clean_text的列中:

dput(droplevels(head(TestTweets, 20)))

structure(list(Time = c("24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:03", "24/06/2016 10:55:03"
), clean_text = c("mayagoodfellow as always making sense of it all for us ive never felt less welcome in this country brexit  httpstcoiai5xa9ywv", 
"never underestimate power of stupid people in a democracy brexit", 
"a quick guide to brexit and beyond after britain votes to quit eu httpstcos1xkzrumvg httpstcocniutojkt0", 
"this selfinflicted wound will be his legacy cameron falls on sword after brexit euref httpstcoegph3qonbj httpstcohbyhxodeda", 
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", 
"this is a very good summary no biasspinagenda of the legal ramifications of the leave result brexit httpstcolobtyo48ng", 
"you cant make this up cornwall votes out immediately pleads to keep eu cash this was never a rehearsal httpstco", 
"no matter the outcome brexit polls demonstrate how quickly half of any population can be convinced to vote against itself q", 
"i wouldnt mind so much but the result is based on a pack of lies and unaccountable promises democracy didnt win brexit pro", 
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", 
"absolutely brilliant poll on brexit by yougov httpstcoepevg1moaw", 
"retweeted mikhail golub golub\r\n\r\nbrexit to be followed by grexit departugal italeave fruckoff czechout httpstcoavkpfesddz", 
"think the brexit campaign relies on the same sort of logic that drpepper does whats the worst that can happen thingsthatarewellbrexit", 
"am baffled by nigel farages claim that brexit is a victory for real people as if the 47 voting remain are fucking smu", 
"not one of the uks problems has been solved by brexit vote migration inequality the uks centurylong decline as", 
"scotland should never leave eu  calls for new independence vote grow httpstcorudiyvthia brexit", 
"the most articulate take on brexit is actually this ft reader comment today httpstco98b4dwsrtv", 
"65 million refugees half of them are children  maybe instead of fighting each other we should be working hand in hand ", 
"im laughing at people who voted for brexit but are complaining about the exchange rate affecting their holiday\r\nremain", 
"life is too short to wear boring shoes  brexit")), .Names = c("Time", 
"clean_text"), row.names = c(NA, 20L), class = c("tbl_df", "tbl", 
"data.frame"))

这里是kw_Emo

kw_Emo <- c("abusi*", "accept", "accepta*", "accepted", "accepting", "accepts", "ache*", "aching", "active*", "admir*", "ador*", "advantag*", "adventur*", "advers*", "affection*", "afraid", "aggravat*", "aggress*", "agoniz*", "agony", "agree", "agreeab*", "agreed", "agreeing", "agreement*", "agrees", "alarm*", "alone", "alright*", "amaz*", "amor*", "amus*", "anger*", "angr*", "anguish*", "annoy*", "antagoni*", "anxi*", "aok", "apath*", "appall*", "appreciat*", "apprehens*", "argh*", "argu*", "arrogan*", "asham*", "assault*", "asshole*", "assur*", "attachment*", "attract*", "aversi*", "avoid*", "award*", "awesome", "awful", "awkward*", "bashful*", "bastard*", "battl*", "beaten", "beaut*", "beloved", "benefic*", "benevolen*", "benign*", "best", "better", "bitch*", "bitter*", "blam*", "bless*", "bold*", "bonus*", "bore*", "boring", "bother*", "brave*", "bright*", "brillian*", "broke", "burden*", "calm*", "cared", "carefree", "careful*", "careless*", "cares", "casual", "casually", "certain*", "challeng*", "champ*", "charit*", "charm*", "cheer*", "cherish*", "chuckl*", "clever*", "comed*", "comfort*", "commitment*", "complain*", "compliment*", "concerned", "confidence", "confident", "confidently", "confront*", "confus*", "considerate", "contempt*", "contented*", "contentment", "contradic*", "convinc*", "cool", "courag*", "crap", "crappy", "craz*", "create*", "creati*", "credit*", "cried", "cries", "critical", "critici*", "crude*", "cry", "crying", "cunt*", "cut", "cute*", "cutie*", "cynic", "danger*", "daring", "darlin*", "daze*", "dear*", "decay*", "defeat*", "defect*", "definite", "definitely", "degrad*", "delectabl*", "delicate*", "delicious*", "deligh*", "depress*", "depriv*", "despair*", "desperat*", "despis*", "destruct*", "determina*", "determined", "devastat*", "difficult*", "digni*", "disadvantage*", "disagree*", "disappoint*", "disaster*", "discomfort*", "discourag*", "dishearten*", "disillusion*", "dislike", "disliked", "dislikes", "disliking", "dismay*", "dissatisf*", "distract*", "distraught", "distress*", "distrust*", "disturb*", "divin*", "domina*", "doom*", "dork*", "doubt*", "dread*", "dull*", "dumb*", "dump*", "dwell*", "dynam*", "eager*", "ease*", "easie*", "easily", "easiness", "easing", "easy*", "ecsta*", "efficien*", "egotis*", "elegan*", "embarrass*", "emotion", "emotional", "empt*", "encourag*", "energ*", "engag*", "enjoy*", "enrag*", "entertain*", "enthus*", "envie*", "envious", "excel*", "excit*", "excruciat*", "exhaust*", "fab", "fabulous*", "fail*", "fake", "fantastic*", "fatal*", "fatigu*", "favor*", "favour*", "fear", "feared", "fearful*", "fearing", "fearless*", "fears", "feroc*", "festiv*", "feud*", "fiery", "fiesta*", "fine", "fired", "flatter*", "flawless*", "flexib*", "flirt*", "flunk*", "foe*", "fond", "fondly", "fondness", "fool*", "forgave", "forgiv*", "fought", "frantic*", "freak*", "free", "freeb*", "freed*", "freeing", "freely", "freeness", "freer", "frees*", "friend*", "fright*", "frustrat*", "fuck", "fucked*", "fucker*", "fuckin*", "fucks", "fume*", "fuming", "fun", "funn*", "furious*", "fury", "geek*", "genero*", "gentle", "gentler", "gentlest", "gently", "giggl*", "giver*", "giving", "glad", "gladly", "glamor*", "glamour*", "gloom*", "glori*", "glory", "goddam*", "gorgeous*", "gossip*", "grace", "graced", "graceful*", "graces", "graci*", "grand", "grande*", "gratef*", "grati*", "grave*", "great", "grief", "griev*", "grim*", "grin", "grinn*", "grins", "grouch*", "grr*", "guilt*", "ha", "haha*", "handsom*", "happi*", "happy", "harass*", "hated", "hateful*", "hater*", "hates", "hating", "hatred", "hazy", "heartbreak*", "heartbroke*", "heartfelt", "heartless*", "heartwarm*", "heh*", "hellish", "helper*", "helpful*", "helping", "helpless*", "helps", "hesita*", "hilarious", "hoho*", "homesick*", "honour*", "hope", "hoped", "hopeful", "hopefully", "hopefulness", "hopeless*", "hopes", "hoping", "horr*", "hostil*", "hug", "hugg*", "hugs", "humiliat*", "humor*", "humour*", "hurra*", "idiot", "ignor*", "impatien*", "impersonal", "impolite*", "importan*", "impress*", "improve*", "improving", "inadequa*", "incentive*", "indecis*", "ineffect*", "inferior*", "inhib*", "innocen*", "insecur*", "insincer*", "inspir*", "insult*", "intell*", "interest*", "interrup*", "intimidat*", "invigor*", "irrational*", "irrita*", "isolat*", "jaded", "jealous*", "jerk", "jerked", "jerks", "joke*", "joking", "joll*", "joy*", "keen*", "kidding", "kind", "kindly", "kindn*", "kiss*", "laidback", "lame*", "laugh*", "lazie*", "lazy", "liabilit*", "libert*", "lied", "lies", "like", "likeab*", "liked", "likes", "liking", "livel*", "LMAO", "LOL", "lone*", "longing*", "lose", "loser*", "loses", "losing", "loss*", "lost", "lous*", "love", "loved", "lovely", "lover*", "loves", "loving*", "low*", "luck", "lucked", "lucki*", "luckless*", "lucks", "lucky", "ludicrous*", "lying", "mad", "maddening", "madder", "maddest", "madly", "magnific*", "maniac*", "masochis*", "melanchol*", "merit*", "merr*", "mess", "messy", "miser*", "miss", "missed", "misses", "missing", "mistak*", "mock", "mocked", "mocker*", "mocking", "mocks", "molest*", "mooch*", "mood", "moodi*", "moods", "moody", "moron*", "mourn*", "nag*", "nast*", "neat*", "needy", "neglect*", "nerd*", "nervous*", "neurotic*", "nice*", "numb*", "nurtur*", "obnoxious*", "obsess*", "offence*", "offens*", "ok", "okay", "okays", "oks", "openminded*", "openness", "opportun*", "optimal*", "optimi*", "original", "outgoing", "outrag*", "overwhelm*", "pained", "painf*", "paining", "painl*", "pains", "palatabl*", "panic*", "paradise", "paranoi*", "partie*", "party*", "passion*", "pathetic*", "peculiar*", "perfect*", "personal", "perver*", "pessimis*", "petrif*", "pettie*", "petty*", "phobi*", "piss*", "piti*", "pity*", "play", "played", "playful*", "playing", "plays", "pleasant*", "please*", "pleasing", "pleasur*", "poison*", "popular*", "positiv*", "prais*", "precious*", "pressur*", "prettie*", "pretty", "prick*", "pride", "privileg*", "prize*", "problem*", "profit*", "promis*", "protested", "protesting", "proud*", "puk*", "radian*", "rage*", "raging", "rancid*", "rape*", "raping", "rapist*", "readiness", "ready", "reassur*", "reek*", "regret*", "reject*", "relax*", "relief", "reliev*", "reluctan*", "remorse*", "repress*", "resent*", "resign*", "resolv*", "restless*", "revigor*", "reward*", "rich*", "ridicul*", "rigid*", "risk*", "ROFL", "romanc*", "romantic*", "rotten", "rude*", "sad", "sadde*", "sadly", "sadness", "sarcas*", "satisf*", "savage*", "scare*", "scaring", "scary", "sceptic*", "scream*", "screw*", "selfish*", "sentimental*", "serious", "seriously", "seriousness", "severe*", "shake*", "shaki*", "shaky", "share", "shared", "shares", "sharing", "shit*", "shock*", "shook", "shy*", "sigh", "sighed", "sighing", "sighs", "silli*", "silly", "sincer*", "skeptic*", "smart*", "smil*", "smother*", "smug*", "snob*", "sob", "sobbed", "sobbing", "sobs", "sociab*", "solemn*", "sorrow*", "sorry", "soulmate*", "special", "splend*", "stammer*", "stank", "startl*", "stink*", "strain*", "strange", "strength*", "stress*", "strong*", "struggl*", "stubborn*", "stunk", "stunned", "stuns", "stupid*", "stutter*", "succeed*", "success*", "suck", "sucked", "sucker*", "sucks", "sucky", "sunnier", "sunniest", "sunny", "sunshin*", "super", "superior*", "support", "supported", "supporter*", "supporting", "supportive*", "supports", "suprem*", "sure*", "surpris*", "suspicio*", "sweet", "sweetheart*", "sweetie*", "sweetly", "sweetness*", "sweets", "talent*", "tantrum*", "tears", "teas*", "tehe", "temper", "tempers", "tender*", "tense*", "tensing", "tension*", "terribl*", "terrific*", "terrified", "terrifies", "terrify", "terrifying", "terror*", "thank", "thanked", "thankf*", "thanks", "thief", "thieve*", "thoughtful*", "threat*", "thrill*", "ticked", "timid*", "toleran*", "tortur*", "tough*", "traged*", "tragic*", "tranquil*", "trauma*", "treasur*", "treat", "trembl*", "trick*", "trite", "triumph*", "trivi*", "troubl*", "TRUE", "trueness", "truer", "truest", "truly", "trust*", "truth*", "turmoil", "ugh", "ugl*", "unattractive", "uncertain*", "uncomfortabl*", "uncontrol*", "uneas*", "unfortunate*", "unfriendly", "ungrateful*", "unhapp*", "unimportant", "unimpress*", "unkind", "unlov*", "unpleasant", "unprotected", "unsavo*", "unsuccessful*", "unsure*", "unwelcom*", "upset*", "uptight*", "useful*", "useless*", "vain", "valuabl*", "valuing", "vanity", "vicious*", "vigor*", "vigour*", "villain*", "violat*", "virtuo*", "vital*", "vulnerab*", "vulture*", "warfare*", "warm*", "warred", "weak*", "wealth*", "weapon*", "weep*", "weird*", "welcom*", "well*", "wept", "whine*", "whining", "willing", "wimp*", "win", "winn*", "wins", "wisdom", "wise*", "witch", "woe*", "won", "wonderf*", "worr*", "worse*", "worship*", "worst", "wow*", "yay", "yays","yearn*","stench*")


MRau答案中使用的对我不起作用的代码:

ind_stem <- grep("[*]", kw_Emo)
kw_stem  <- gsub("[*]", "", kw_Emo[ind_stem])
kw_word  <- kw_Emo[-ind_stem]
tweets <- strsplit(TestTweets[, "clean_text"], "\\s+")

for (kws in kw_stem) {
  count_i <- unlist(lapply(tweets, function(x) length(grep(kws, x))))
  TestTweets <- cbind(TestTweets, count_i)
  colnames(TestTweets)[ncol(TestTweets)] <- paste0(kws, "*")
}
for (kww in kw_word) {
  count_i <- unlist(lapply(tweets, function(x) length(grep(paste0("^", kww, "$"), x))))
  TestTweets <- cbind(TestTweets, count_i)
  colnames(TestTweets)[ncol(TestTweets)] <- kww
}

1 个答案:

答案 0 :(得分:1)

所以首先我要摆脱一些for循环:

ind_stem <- grep("[*]", kw_Emo)
kw_stem  <- gsub("[*]", "", kw_Emo[ind_stem])
kw_word  <- kw_Emo[-ind_stem]
tweets <- strsplit(TestTweets[, "clean_text"], "\\s+")

我为单词和词干生成了一个不同的向量。 tweets是单词向量的列表-strsplit使用空格(\\s+)作为分隔符来拆分字符串。

对于单词/词干的匹配,您可以同时使用grep。默认情况下,它将查找包含给定模式的所有单词:

> grep("Abc", c("Abc", "Abcdef"))
[1] 1 2

但是如果您使用^$,则可以得到“完全匹配”:

> grep("^Abc$", c("Abc", "Abcdef"))
[1] 1

在您的代码中,您想查看grep输出的长度,例如将其附加到您的data.frame

for (kws in kw_stem) {
    count_i <- unlist(lapply(tweets, function(x) length(grep(kws, x))))
    TestTweets <- cbind(TestTweets, count_i)
    colnames(TestTweets)[ncol(TestTweets)] <- paste0(kws, "*")
}
for (kww in kw_word) {
    count_i <- unlist(lapply(tweets, function(x) length(grep(paste0("^", kww, "$"), x))))
    TestTweets <- cbind(TestTweets, count_i)
    colnames(TestTweets)[ncol(TestTweets)] <- kww
}

输出片段:

> TestTweets[19:20, c("clean_text", "boring")]
                                                                                                                    clean_text boring
19 im laughing at people who voted for brexit but are complaining about the exchange rate affecting their holiday\r\nremain      0
20                                                                           life is too short to wear boring shoes  brexit      1

当然,您可以根据自己的问题等进一步优化此代码或决定是否在第一个循环中使用grep(paste0("^", kws), x)而不是grep(kws, x)