我已经编写了代码来从评论数据中提取前三个可疑的安全代码:
flag_stock_codes <- function(df) {
# NYSE has 3 digit codes, NASDAQ has 2-5.
# there arent a lot of 2 digit codes though so we will use 3-5 to avoid excess false positives
df <- df %>%
mutate(sec_code_1 = unlist(str_extract_all(title,"\\b[A-Z]{3,5}+\\b")[[1]][1])
, sec_code_2 = unlist(str_extract_all(title,"\\b[A-Z]{3,5}+\\b")[[1]][2])
, sec_code_3 = unlist(str_extract_all(title,"\\b[A-Z]{3,5}+\\b")[[1]][3]))
df
}
# test 1
test %>% filter(id %in% c("l98qhb","l98ppp")) %>% flag_stock_codes()
输出:
id title score author author_flair_text removed_by
1 l98qhb IF NOK HITS $500/SHARE, I'LL TATTOO DIAMONDS ON MY HANDS 1 Money_trees_planted moderator
2 l98ppp AMC GME TO THE MARS 1 tehspiekguy moderator
total_awards_received awarders created_utc
1 0 [] 1612084105
2 0 [] 1612084011
full_link num_comments over_18
1 https://www.reddit.com/r/wallstreetbets/comments/l98qhb/if_nok_hits_500share_ill_tattoo_diamonds_on_my/ 0 False
2 https://www.reddit.com/r/wallstreetbets/comments/l98ppp/amc_gme_to_the_mars/ 0 False
sec_code_1 sec_code_2 sec_code_3
1 NOK HITS SHARE
2 NOK HITS SHARE
但是我注意到我的逻辑只是提取第一行,并用这个值填充所有其他行。我希望它对每个评论进行逐行操作:
即最后三列:
sec_code_1 | sec_code_2 | sec_code_3 |
---|---|---|
不同意 | 命中 | 分享 |
AMC | GME | 那个 |
有谁知道我可以如何修改我的逻辑来实现这一点?
答案 0 :(得分:0)
它只提取第一行,因为当您在代码中指定 [[1]]
时,您告诉它这样做。您可以将 map_chr
与 pluck
一起使用。
library(tidyverse)
flag_stock_codes <- function(df) {
df %>%
mutate(sec_code_1 = map_chr(str_extract_all(title,"\\b[A-Z]{3,5}+\\b"), pluck, 1),
sec_code_2 = map_chr(str_extract_all(title,"\\b[A-Z]{3,5}+\\b"), pluck, 2),
sec_code_3 = map_chr(str_extract_all(title,"\\b[A-Z]{3,5}+\\b"), pluck, 3))
}
test %>% filter(id %in% c("l98qhb","l98ppp")) %>% flag_stock_codes()
再说一次,如果您关注 my answer on your previous question,那么您一开始就不会遇到这个问题。
flag_stock_codes <- function(df) {
df %>%
mutate(code = str_extract_all(title,"\\b[A-Z]{3,5}\\b")) %>%
unnest_wider(code) %>%
rename_with(~paste0('sec_code_', seq_along(.)), starts_with('..'))
}
test %>%
filter(id %in% c("l98qhb","l98ppp"))
flag_stock_codes()