数据框和文本挖掘

时间:2020-05-29 15:51:41

标签: r text stringr

class Load():
    def __init__(self):
        print("Starting Now")
        self.player = []
    @staticmethod
    def player_Stats(filename):
        with open(filename) as my_names:
            names = my_names.readlines()
            for one in names:
                one.replace("\n","")
                self.player.append(one.split[":"][0])
                print(player)


print(Load.player_Stats("players.txt"))

我得到的输出:

library(stringr)
data<-data.frame(id=c(1,2,3), 
          text=c("This is (2020) text; mining exercise (1999)","Text analysis (1975) is; bit confusing (2012)","Hint (1998) on; this text (2007) analysis?"))

a <- b <- list()
mm <- data.frame(a=NA,b=NA)
for(i in 1:length(data$text)){
   a[[i]] <- lengths(strsplit(as.character(data$text[i]),";"))
   b[[i]] <- str_count(data$text[i], "\\(19[0-9]{2}\\)|\\(20[0-9]{2}\\)")
}

为什么我没有为数据帧# mm a b 1 NA NA 的每一行都获得相应的值?代码也没有错误。

预期输出:

mm

2 个答案:

答案 0 :(得分:2)

循环完成后,您将得到两个列表,ab以及预期的输出:

a
[[1]]
[1] 2

[[2]]
[1] 2

[[3]]
[1] 2

但是您永远不会将这些值分配给data.frame

mm <- data.frame(a=unlist(a),b=unlist(b))
mm
  a b
1 2 2
2 2 2
3 2 2

答案 1 :(得分:1)

带有tidyverse

的选项
library(dplyr)
library(stringr)
library(purrr)
data %>% 
   transmute(out = str_split(text, ";")) %>% 
   transmute(a = lengths(out),
       b = lengths(map(out, ~ str_extract(.x, "(?<=(19|20))[0-9]{2}\\b"))))
#  a b
#1 2 2
#2 2 2
#3 2 2