提前致谢!我已经尝试了几天,我有点卡住了。我试图循环文本文件(作为列表导入),并从文本文件创建一个数据框。如果列表中的项目在文本中具有星期几,则数据框将开始新行,并将填充在第一列(V1)中。我想将其余的注释放在第二列(V2)中,我可能需要将字符串连接在一起。我试图在grepl()中使用条件,但是在设置初始数据帧之后,我对逻辑很失落。
这是我带入R的示例文本(它是来自文本文件的Facebook数据)。 []表示列表编号。这是一个冗长的文件(50K +行),但我设置了日期列。
[1] 2016年8月25日星期四美国东部时间下午3:57
[2] 足球时间!!我们需要制定计划!!!!我给我的家伙发了短信,虽然去年还没有接触过。所以我们会在最后看到!!!你有什么做饭???
[3] 2016年8月14日星期日美国东部时间上午9:17
[4]迈克尔分享杰森的帖子。
[5]这只鸟比我最近在这里阅读的大多数政治帖子更聪明
[6] 2016年8月14日星期日美国东部时间上午8:44
[7] Michael和Kurt现在是朋友。
最终结果将是数据框,其中星期几在数据框中开始新行,并且列表的其余部分连接到数据框的第二列。所以最终数据的名气将是
第1行(V1中为[1],V2中为[2])
第2行(V1中的[3]和V2中的[4],[5])
第3行(V1中为[6],V2中为[7])
这是我的代码的开始,我可以正确填充V1,但不能填充数据框的第二列。
### Read in the text file
temp <- readLines("C:/Program Files/R/Text Mining/testa.txt")
### Remove empty lines from the text file
temp <- temp[temp!=""]
### Create the temp char file as a list file
tmp <- as.list(temp)
### A days vector for searching through the list of days.
days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday")
df <- {}
### Loop through the list
for (n in 1:length(tmp)){
### Search to see if there is a day in the list item
for(i in 1:length(days)){
if(grepl(days[i], tmp[n])==1){
### Bind the row to the df if there is a day in the list item
df<- rbind(df, tmp[n])
}
}
### I know this is wrong, I am trying to create a vector to concatenate and add to the data frame, but I am struggling here.
d <- c(d, tmp[n])
}
答案 0 :(得分:1)
这是使用tidyverse的选项:
library(tidyverse)
text <- "[1] Thursday, August 25, 2016 at 3:57pm EDT
[2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???
[3]Sunday, August 14, 2016 at 9:17am EDT
[4]Michael shared Jason post.
[5]This bird is a lot smarter than the majority of political posts I have read recently here
[6]Sunday, August 14, 2016 at 8:44am EDT
[7]Michael and Kurt are now friends."
df <- data_frame(lines = read_lines(text)) %>% # read data, set up data.frame
filter(lines != '') %>% # filter out empty lines
# set grouping by cumulative number of rows with weekdays in them
group_by(grp = cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), lines))) %>%
# collapse each group to two columns
summarise(V1 = lines[1], V2 = list(lines[-1]))
df
## # A tibble: 3 × 3
## grp V1 V2
## <int> <chr> <list>
## 1 1 [1] Thursday, August 25, 2016 at 3:57pm EDT <chr [1]>
## 2 2 [3]Sunday, August 14, 2016 at 9:17am EDT <chr [2]>
## 3 3 [6]Sunday, August 14, 2016 at 8:44am EDT <chr [1]>
此方法使用V2
的列表列,这可能是保留数据方面的最佳方法,但如果需要,请使用paste
或toString
。
大致相当的基数R:
df <- data.frame(V2 = readLines(textConnection(text)), stringsAsFactors = FALSE)
df <- df[df$V2 != '', , drop = FALSE]
df$grp <- cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), df$V2))
df$V1 <- ave(df$V2, df$grp, FUN = function(x){x[1]})
df <- aggregate(V2 ~ grp + V1, df, FUN = function(x){x[-1]})
df
## grp V1
## 1 1 [1] Thursday, August 25, 2016 at 3:57pm EDT
## 2 2 [3]Sunday, August 14, 2016 at 9:17am EDT
## 3 3 [6]Sunday, August 14, 2016 at 8:44am EDT
## V2
## 1 [2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???
## 2 [4]Michael shared Jason post., [5]This bird is a lot smarter than the majority of political posts I have read recently here
## 3 [7]Michael and Kurt are now friends.