Question

我的数据框列如下所示：

head(tweets_date$Tweet)
[1] b"It is @DineshKarthik's birthday and here's a rare image of the captain of @KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac                                                                                                                             
[2] b'The awesome @IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s @prabhakaran285 engaging with the @ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81 
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!                                                                                                                                                                                                                                  
[4] b'CHAMPIONS - 2018 #IPLFinal                                                                                                                                                                                                                                                                 
[5] b'Chennai are Super Kings. A fairytale comeback as @ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86.  This is their moment to cherish, a moment to savour.                                          
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets

这些推文以“@”开头提到，我需要提取所有这些推文，并将每个提及的内容保存为“@提及@提及2”。目前我的代码只是将它们提取为列表。

我的代码：

tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "@\\w+")

如何将每行中的列表折叠为由空格分隔的字符串形式，如前所述。

提前致谢。

Answer 1

我相信在这种情况下使用asis列最好：

提取单词：

library(stringr)
Mentions <- str_extract_all(lis, "@\\w+")

一些数据框：

df <- data.frame(col = 1:6, lett = LETTERS[1:6])

创建一个列表列：

df$Mentions <- I(Mentions)
df
#output
  col lett     Mentions
1   1    A @DineshK....
2   2    B @IPL, @p....
3   3    C             
4   4    D             
5   5    E  @ChennaiIPL
6   6    F

我认为这更好，因为它允许非常简单的子设置：

df$Mentions[[1]]
#output
[1] "@DineshKarthik" "@KKRiders"  

df$Mentions[[1]][1]
#output
[1] "@DineshKarthik"

并且在打印df时简洁地显示了列内部的内容。

数据：

lis <- c("b'It is @DineshKarthik's birthday and here's a rare image of the captain of @KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",                                                                                                                             
"b'The awesome @IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s @prabhakaran285 engaging with the @ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",                                                                                                                                                                                                                                  
"b'CHAMPIONS - 2018 #IPLFinal",                                                                                                                                                                                                                                                           
"b'Chennai are Super Kings. A fairytale comeback as @ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86.  This is their moment to cherish, a moment to savour.",                                          
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")

Answer 2

str_extract_all包中的stringr函数返回一个字符向量列表。因此，如果您想要一个单个CSV术语列表，那么您可以尝试使用sapply作为基本R选项：

tweets <- str_extract_all(tweets_date$Tweet, "@\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))

Demo

Answer 3

通过Twitter的帮助网站：“您的用户名不能超过15个字符。您的真实姓名可以更长（20个字符），但为了方便起见，用户名会更短。用户名只能包含字母数字字符（字母AZ，数字0-9），但下划线除外，如上所述。检查以确保您所需的用户名不包含任何符号，短划线或空格。“

请注意，电子邮件地址可以是推文，也可以是带有@的网址（而不仅仅是主机组件中带有用户名/密码的愚蠢网址）。因此，像：

(^|[^[[:alnum:]_]@/\\!?=&])@([[:alnum:]_]{1,15})\\b

可能是更好，更安全的选择

在R数据帧中提取以@开头的单词并另存为新列

3 个答案:

Demo