在R / RStudio中,查找包含另一列的每一行的一列中的字符串的行ID

时间:2015-10-18 20:14:49

标签: r dataframe

如果某行在我的数据框中的“ReferenceText”列中包含文本,则“文本”列中的相应文本代表回复注释。如果是NA,则“文本”列中的相应文本表示原始帖子。

如果可能的话我想使用正则表达式(我在想gregexpr和regmatches),但是如果还有其他形式的模式匹配在R / RStudio中执行以下操作:

我想在“文本”列中找到“ReferenceText”文本的匹配文本,并将“Text”观察的“ID”对应于每个“ReferenceText”观察的“PostID”列。我还想索引每个回复评论的“PostID”的订单(“序列”)。

例如,如果回复原始帖子(PostID = 5)(“参考文本”第6行第9行中找到“文本”文本),则“参考文本”第6行“PostID”观察将被标记为“PostID” = 5“,”序列“观察将被标记为”PostID = 5 Sequence = 1“。如果原始帖子再次回复或在“ReferenceText”(“ReferenceText”第9行)中重复,则该“序列”观察将被标记为“PostID = 5 Sequence = 2”。我有相当大的数据集(160,000多个观测值),因此非常感谢能够解决这个问题的功能。有任何想法吗?

我希望这是有道理的。请注意,任何R / RStudio解决方案都足够了。 image of my sampleDF

我粘贴了data.frame对象的dput输出。

## dput output assigned to my sampleDF data frame
> dput(sampleDF)
structure(list(ID = 1:30, Screen.Name = c("User 1", "User 2", 
"User 3", "User 4", "User 5", "User 6", "User 7", "User 8", "User 5", 
"User 9", "User 9", "User 1", "User 1", "User 10", "User 8", 
"User 11", "Company", "User 12", "User 13", "User 14", "User 15", 
"User 16", "User 17", "User 18", "User 19", "User 13", "User 20", 
"User 21", "Uer 21", "User 21"), Text = c("Can anyone tell me where in the bloody world this TROLL came from.  Is he the troll of the week at the national trolling academy?  https://www.facebook.com/joseph.barnhorst", 
"company's \"You're Kinda a Big Deal\" promotion is kinda lame and insulting.  How about a service that actually is up to speed as advertised?  Now THATwould be a big deal for company.", 
"Hope I win sumthing!", "Im paying 90 dollars for a reason, so fix whatever is broken so I can actually use my phone!", 
"How do you sign up for the Your A Big Deal Sweepstakes?", "http://company.promo.eprize.com/sweepstakes/:b=chrome/?INTCID=TSC:MyS:MyA:Skn:013113:EngagementSweeps#", 
"Thanks for your giant mess up. I'm down 370 dollars.", "When will the blackberry 10 be available ?", 
"Thank you for the link but should the email get you there also?  I clicked on the mobile ad in and e-bill options to finish the sign up for those and received an error message for both.  Perhaps a link isn't working properly there either.", 
"Wait, Could it be Joseph is upset cause his Milkshake didn't bring the boys to his yard? Must see, look at this > http://www.youtube.com/watch?v=gFK8zYYoMtQ", 
"What a putz.... lol", "LMBO!", "The Blackberry Q10 will be available to US carriers in April.", 
"My mobile hot spot just shut off Randomly and now It tell me to set it up again I already have it on my plan", 
"I'm really interested in seeing this phone, I hope it's as good if not better than iPhones cause nothing new has challenged apple really", 
"Turn off the LTE in Carlisle it does work like at all. Or please fix. Won't load anything under LTE", 
"Need a hand? Check out this redesigned umbrella handle that lets you keep texting even during a downpour. http://bit.ly/12fniVl", 
"why we are still having service issues,  ", "unlimited data isn't worth anything when you can't get service with or without a femtocell and tech support has been next to useless over the past few months.", 
"Because everyone should be texting while walking in the rain... face palm.", 
"When is the LTE going to be available in NYC?? I was told end of last year... but it's Feb now.....", 
"Get us 4g already", "So silly. But people will buy it I am sure", 
"rubbish", "Free umbrella with company phones?! I bet it helps with the sewage internet connection you guys have.", 
"oh and more loveliness.. just had a company rep hang up on me..this is twice...nice job.", 
"Oh wow", "The pressure is getting to them. The CEO has put them in a no win situation.", 
"Never", "LTE means Lying To Everyone."), ReferenceText = c("NA", 
"NA", "NA", "NA", "NA", "How do you sign up for the Your A Big Deal Sweepstakes?", 
"NA", "NA", "How do you sign up for the Your A Big Deal Sweepstakes?", 
"Can anyone tell me where in the bloody world this TROLL came from.  Is he the troll of the week at the national trolling academy?  https://www.facebook.com/joseph", 
"Can anyone tell me where in the bloody world this TROLL came from.  Is he the troll of the week at the national trolling academy?  https://www.facebook.com/joseph", 
"Can anyone tell me where in the bloody world this TROLL came from.  Is he the troll of the week at the national trolling academy?  https://www.facebook.com/joseph", 
"When will the blackberry 10 be available ?", "NA", "When will the blackberry 10 be available ?", 
"NA", "NA", "NA", "NA", "Need a hand? Check out this redesigned umbrella handle that lets you keep texting even during a downpour. http://bit.ly/12fniVl", 
"NA", "Need a hand? Check out this redesigned umbrella handle that lets you keep texting even during a downpour. http://bit.ly/12fniVl", 
"Need a hand? Check out this redesigned umbrella handle that lets you keep texting even during a downpour. http://bit.ly/12fniVl", 
"Need a hand? Check out this redesigned umbrella handle that lets you keep texting even during a downpour. http://bit.ly/12fniVl", 
"Need a hand? Check out this redesigned umbrella handle that lets you keep texting even during a downpour. http://bit.ly/12fniVl", 
"NA", "Need a hand? Check out this redesigned umbrella handle that lets you keep texting even during a downpour. http://bit.ly/12fniVl", 
"oh and more loveliness.. just had a company rep hang up on me..this is twice...nice job. ", 
"When is the LTE going to be available in NYC?? I was told end of last year... but it's Feb now.....", 
"NA"), PostID = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA), Sequence = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA), DATE = c("2/1/2013", "2/1/2013", "2/1/2013", 
"2/1/2013", "2/1/2013", "2/1/2013", "2/1/2013", "2/1/2013", "2/1/2013", 
"2/1/2013", "2/1/2013", "2/1/2013", "2/1/2013", "2/1/2013", "2/1/2013", 
"2/1/2013", "2/1/2013", "2/1/2013", "2/1/2013", "2/1/2013", "2/1/2013", 
"2/1/2013", "2/1/2013", "2/1/2013", "2/1/2013", "2/1/2013", "2/1/2013", 
"2/1/2013", "2/1/2013", "2/1/2013"), X_M__millitary_time_ = c("16:46:20", 
"16:52:07", "16:55:54", "17:08:41", "17:10:08", "17:13:01", "17:13:17", 
"17:15:17", "17:19:01", "17:36:39", "17:41:08", "17:42:44", "17:45:42", 
"17:50:08", "17:50:53", "17:53:25", "18:00:01", "18:01:18", "18:03:37", 
"18:04:26", "18:05:41", "18:10:58", "18:11:17", "18:11:20", "18:11:41", 
"18:11:58", "18:13:19", "18:17:13", "18:18:34", "18:19:53"), 
timestampM = c("2/1/2013 16:46", "2/1/2013 16:52", "2/1/2013 16:55", 
"2/1/2013 17:08", "2/1/2013 17:10", "2/1/2013 17:13", "2/1/2013 17:13", 
"2/1/2013 17:15", "2/1/2013 17:19", "2/1/2013 17:36", "2/1/2013 17:41", 
"2/1/2013 17:42", "2/1/2013 17:45", "2/1/2013 17:50", "2/1/2013 17:50", 
"2/1/2013 17:53", "2/1/2013 18:00", "2/1/2013 18:01", "2/1/2013 18:03", 
"2/1/2013 18:04", "2/1/2013 18:05", "2/1/2013 18:10", "2/1/2013 18:11", 
"2/1/2013 18:11", "2/1/2013 18:11", "2/1/2013 18:11", "2/1/2013 18:13", 
"2/1/2013 18:17", "2/1/2013 18:18", "2/1/2013 18:19"), timestampN = c("2/1/2013 16:46", 
"2/1/2013 16:52", "2/1/2013 16:55", "2/1/2013 17:08", "2/1/2013 17:10", 
"2/1/2013 17:13", "2/1/2013 17:13", "2/1/2013 17:15", "2/1/2013 17:19", 
"2/1/2013 17:36", "2/1/2013 17:41", "2/1/2013 17:42", "2/1/2013 17:45", 
"2/1/2013 17:50", "2/1/2013 17:50", "2/1/2013 17:53", "2/1/2013 18:00", 
"2/1/2013 18:01", "2/1/2013 18:03", "2/1/2013 18:04", "2/1/2013 18:05", 
"2/1/2013 18:10", "2/1/2013 18:11", "2/1/2013 18:11", "2/1/2013 18:11", 
"2/1/2013 18:11", "2/1/2013 18:13", "2/1/2013 18:17", "2/1/2013 18:18", 
"2/1/2013 18:19")), .Names = c("ID", "Screen.Name", "Text", 
"ReferenceText", "PostID", "Sequence", "DATE", "X_M__millitary_time_", 
"timestampM", "timestampN"), class = "data.frame", row.names = c(NA, 
-30L))

0 个答案:

没有答案