在R数据帧中提取以@开头的单词并另存为新列

时间:2018-06-15 09:16:17

标签: r regex twitter

enter image description here我的数据框列如下所示:

head(tweets_date$Tweet)
[1] b"It is @DineshKarthik's birthday and here's a rare image of the captain of @KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac                                                                                                                             
[2] b'The awesome @IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s @prabhakaran285 engaging with the @ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81 
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!                                                                                                                                                                                                                                  
[4] b'CHAMPIONS - 2018 #IPLFinal                                                                                                                                                                                                                                                                 
[5] b'Chennai are Super Kings. A fairytale comeback as @ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86.  This is their moment to cherish, a moment to savour.                                          
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets                                                                                                                                                                                                                       

这些推文以“@”开头提到,我需要提取所有这些推文,并将每个提及的内容保存为“@提及@提及2”。目前我的代码只是将它们提取为列表。

我的代码:

tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "@\\w+")

如何将每行中的列表折叠为由空格分隔的字符串形式,如前所述。

提前致谢。

3 个答案:

答案 0 :(得分:4)

我相信在这种情况下使用asis列最好:

提取单词:

library(stringr)
Mentions <- str_extract_all(lis, "@\\w+")

一些数据框:

df <- data.frame(col = 1:6, lett = LETTERS[1:6])

创建一个列表列:

df$Mentions <- I(Mentions)
df
#output
  col lett     Mentions
1   1    A @DineshK....
2   2    B @IPL, @p....
3   3    C             
4   4    D             
5   5    E  @ChennaiIPL
6   6    F             

我认为这更好,因为它允许非常简单的子设置:

df$Mentions[[1]]
#output
[1] "@DineshKarthik" "@KKRiders"  

df$Mentions[[1]][1]
#output
[1] "@DineshKarthik"

并且在打印df时简洁地显示了列内部的内容。

数据:

lis <- c("b'It is @DineshKarthik's birthday and here's a rare image of the captain of @KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",                                                                                                                             
"b'The awesome @IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s @prabhakaran285 engaging with the @ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",                                                                                                                                                                                                                                  
"b'CHAMPIONS - 2018 #IPLFinal",                                                                                                                                                                                                                                                           
"b'Chennai are Super Kings. A fairytale comeback as @ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86.  This is their moment to cherish, a moment to savour.",                                          
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")

答案 1 :(得分:3)

str_extract_all包中的stringr函数返回一个字符向量列表。因此,如果您想要一个单个CSV术语列表,那么您可以尝试使用sapply作为基本R选项:

tweets <- str_extract_all(tweets_date$Tweet, "@\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))

Demo

答案 2 :(得分:1)

通过Twitter的帮助网站:“您的用户名不能超过15个字符。您的真实姓名可以更长(20个字符),但为了方便起见,用户名会更短。用户名只能包含字母数字字符(字母AZ,数字0-9),但下划线除外,如上所述。检查以确保您所需的用户名不包含任何符号,短划线或空格。“

请注意,电子邮件地址可以是推文,也可以是带有@的网址(而不仅仅是主机组件中带有用户名/密码的愚蠢网址)。因此,像:

(^|[^[[:alnum:]_]@/\\!?=&])@([[:alnum:]_]{1,15})\\b

可能是更好,更安全的选择