更快的替代方法,提取文本匹配另一个向量中的单词中的所有单词

时间:2016-04-04 01:00:23

标签: r stringr

我有一个带有一列文本的300,000行数据框。我需要提取每行中另一个单词矢量中的所有单词。

dput(df)
structure(list(DateReceived = structure(c(16800, 16800, 16800, 
16800, 16800, 16800), class = "Date"), CleanText = c("deposit check account 2800 00 available balance 4300 00 spent 2500 00 2800 00 check spent xxxx 1900 00 money checking account still saved school received looked available balance xxxx called bank told check deposited fraudulent check wait seven days clear within seven days check still n clear negative balanced remained told check returned pay return fee plus negative balance want know n bank teller tcf let deposit check without telling fake tell cash check give money back put account never cleared went pay back", 
"xxxx xxxx owner xxxx xxxx xxxx xxxx xxxx texas xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx texas purchased property obtaining secured deed trust citimortgage inc loan xxxx deed trust secured bank provided foreclosure sale property satisfy unpaid balance due deed trust since alleged breach agreement made monthly mortgage payments property pursuant deed trust approximately five years though eventually balance accrued bank informed intention foreclose property xxxx xxxx 2015 principal balance deed trust approximately 100000 00 bank foreclose property property sold xxxx xxxx 2015 150000 00 understanding bank received full sales price purchaser property sales price exceeded balance deed trust approximately 49000 00 bank given none 49000 00 excess proceeds sale property despite requests amount rather bank informed would provide money xxxx xxxx corporation lien property notified bank going foreclose house fact called notice sent address mine upon inquiry also found called notice sent one week prior sale property one would expected adequate time notice sent correct address foreclosure", 
"xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx nj xxxx xxxx xxxx 2015 name xxxx xxxx borrower real estate closing xxxx xxxx xxxx xxxx xxxx xxxx xxxx nj xxxx closing took place xxxx xxxx 2015 closing delayed almost 2 months due bank america loan officer underwriter numerous mistakes although boa loan officer reassured penalties rescheduling closing 2800 00 fees closing never agreed see attached hud page xxxx item xxxx one informed adjusted origination charges discrepancies good faith estimates provided see attached gfe dated xxxx xxxx xxxx 1 gfe dated xxxx xxxx 2015 charge 660 00 2 75 estimated settlement charge summary dated xxxx xxxx 2015 showed rate chosen discount points 660 00 respa credit worksheet dated xxxx xxxx 2015 showed loan lock extend fee 660 00 2 gfe dated xxxx xxxx 2015 charge 1000 00 2 75 estimated settlement charge summary dated xxxx xxxx 2015 showed rate chosen discount points 1000 00 respa credit worksheet dated xxxx xxxx 2015 showed discount point fee 660 00 3 gfe dated xxxx xxxx 2015 charge 1500 00 2 75 estimated settlement charge summary dated xxxx xxxx 2015 showed rate chosen discount points 1500 00 respa credit worksheet dated xxxx xxxx 2015 showed discount point fee 1500 00 4 xxxx xxxx 2015 closing actual fee 2800 00 although boa loan officer repeatedly advised waive penalties fees paid 2800 00 closing today received refund check explanation anyone hereby demand bank america give full refund 2800 00 sincerely xxxx xxxx", 
"sears mastercard issued citibank held since 2010 paperless online statements email notification make payments online access online account statements xxxx billing cycles make payments online though receive email notification statement available talked ir customer service recently today offer solution timeline fixing problem explained result software upgrade offered send paper statement would arrive 3 days payment noted violation law insisted legal requirement send statement card holder without online access access cardholder agreement", 
"xxxx 2014 xxxx gift cards purchased home depo 500 00 total 1000 00 ended value 0 00 store told cards used california purchased always possession believe big time fraud going gift cards research says people maybe even mgmt company gets numbers back cards re inks cards innocent people like get ripped buying gifts tv news recently decided pursue year later cards locked etcc selling mgmt involved wont help purchased cards different home depot fine xxxx store commiting fraud act stupid brought attention see letter", 
"received dunning letter xxxx part debt collector attempted get validation debt sending xxxx letters receive satisfactory response sent estoppel letter see attached debt collector unbeknownst filed action xxxx xxxx xxxx courthouse acting representing bank america fraud court discovered default judgement entered xxxx xxxx xxxx xxxx debt discharged wife bankruptcy 2013 evidenced credit report xxxx party debt collectors posing attorney representing bank america fact nothing interlopers purchased old debt pennies dollar attempting collect entire amount properly served sheriff required result default judgement entered"
)), row.names = c(NA, 6L), class = "data.frame", .Names = c("DateReceived", 
"CleanText"))

还有另一个称为“负面”的词语向量,代表负面情绪词。它总共有3500个单词。此处显示的head输出:

head(negative)
[1] "abandon"      "abandoned"    "abandoning"   "abandonment"  "abandonments" "abandons"

我需要提取粘贴的负面词并按如下方式返回:

df$negativeWords
[1] "fraudulent, negative"                                    
[2] "foreclosure, unpaid, alleged, breach, foreclose, inquiry"
[3] "closing, delayed, mistakes, penalties, discrepancies"    
[4] "problem, violation"                                      
[5] "fraud"                                                   
[6] "fraud, default, bankruptcy, posing"  

我提出了这个有效的代码,但是sapply慢了。是否有更有效的替代方案可以避免循环?

library(stringr)
df$negativeWords <- sapply(str_extract_all(df$CleanText, '\\S+'), function(x) paste(unique(x[x %in% negative]), collapse = ', '))

此方法在完整的300,000行数据帧上花费以下时间:

   user  system elapsed 
 45.661   4.795  52.619 

1 个答案:

答案 0 :(得分:1)

将它拆分成字符串后,它从字符串处理转移到匹配/连接问题,data.table做得很好

enum State {Cat, Noise, Food};
string StateStrings[3][2];

...

StateStrings[Cat][0] = "Meow";
StateStrings[Cat][1] = "Ignore";

StateStrings[Noise][0] = "Boing";
StateStrings[Noise][1] = "Thud";

StateStrings[Food][0] = "Lemons";
StateStrings[Food][1] = "Cinnamon";

State mystate = Cat;

...

void change_state(char c)
{
  // on taking character c, it changes current state
  // If state is Cat and I get 1 , change to Food
  // If state is Cat and I get 2 , change to Noise
  // If state is Food and I get 1 , change to Noise
  // If state is Food and I get 2 , change to Cat
  // If state is Noise and I get 1 , change to Cat
  // If state is Noise and I get 2 , change to Food

  switch (mystate)
  {
    case Cat: {
      switch (c) {
        case '1': mystate = Food; break;
        case '2': mystate = Noise; break;
      }
      break;
    }
    case Noise: {
      switch (c) {
        case '1': mystate = Cat; break;
        case '2': mystate = Food; break;
      }
      break;
    }
    case Food: {
      switch (c) {
        case '1': mystate = Noise; break;
        case '2': mystate = Cat; break;
      }
      break;
    }
  }
}