head(tweets_date$Tweet)
[1] b"It is @DineshKarthik's birthday and here's a rare image of the captain of @KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome @IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s @prabhakaran285 engaging with the @ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as @ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
这些推文以“@”开头提到,我需要提取所有这些推文,并将每个提及的内容保存为“@提及@提及2”。目前我的代码只是将它们提取为列表。
我的代码:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "@\\w+")
如何将每行中的列表折叠为由空格分隔的字符串形式,如前所述。
提前致谢。
答案 0 :(得分:4)
我相信在这种情况下使用asis列最好:
提取单词:
library(stringr)
Mentions <- str_extract_all(lis, "@\\w+")
一些数据框:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
创建一个列表列:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A @DineshK....
2 2 B @IPL, @p....
3 3 C
4 4 D
5 5 E @ChennaiIPL
6 6 F
我认为这更好,因为它允许非常简单的子设置:
df$Mentions[[1]]
#output
[1] "@DineshKarthik" "@KKRiders"
df$Mentions[[1]][1]
#output
[1] "@DineshKarthik"
并且在打印df时简洁地显示了列内部的内容。
数据:
lis <- c("b'It is @DineshKarthik's birthday and here's a rare image of the captain of @KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome @IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s @prabhakaran285 engaging with the @ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as @ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
答案 1 :(得分:3)
str_extract_all
包中的stringr
函数返回一个字符向量列表。因此,如果您想要一个单个CSV术语列表,那么您可以尝试使用sapply
作为基本R选项:
tweets <- str_extract_all(tweets_date$Tweet, "@\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
答案 2 :(得分:1)
通过Twitter的帮助网站:“您的用户名不能超过15个字符。您的真实姓名可以更长(20个字符),但为了方便起见,用户名会更短。用户名只能包含字母数字字符(字母AZ,数字0-9),但下划线除外,如上所述。检查以确保您所需的用户名不包含任何符号,短划线或空格。“
请注意,电子邮件地址可以是推文,也可以是带有@的网址(而不仅仅是主机组件中带有用户名/密码的愚蠢网址)。因此,像:
(^|[^[[:alnum:]_]@/\\!?=&])@([[:alnum:]_]{1,15})\\b
可能是更好,更安全的选择