Question

提前致谢！我已经尝试了几天，我有点卡住了。我试图循环文本文件（作为列表导入），并从文本文件创建一个数据框。如果列表中的项目在文本中具有星期几，则数据框将开始新行，并将填充在第一列（V1）中。我想将其余的注释放在第二列（V2）中，我可能需要将字符串连接在一起。我试图在grepl（）中使用条件，但是在设置初始数据帧之后，我对逻辑很失落。

这是我带入R的示例文本（它是来自文本文件的Facebook数据）。 []表示列表编号。这是一个冗长的文件（50K +行），但我设置了日期列。

[1] 2016年8月25日星期四美国东部时间下午3:57

[2] 足球时间!!我们需要制定计划!!!!我给我的家伙发了短信，虽然去年还没有接触过。所以我们会在最后看到!!!你有什么做饭???

[3] 2016年8月14日星期日美国东部时间上午9:17

[4]迈克尔分享杰森的帖子。

[5]这只鸟比我最近在这里阅读的大多数政治帖子更聪明

[6] 2016年8月14日星期日美国东部时间上午8:44

[7] Michael和Kurt现在是朋友。

最终结果将是数据框，其中星期几在数据框中开始新行，并且列表的其余部分连接到数据框的第二列。所以最终数据的名气将是

第1行（V1中为[1]，V2中为[2]）

第2行（V1中的[3]和V2中的[4]，[5]）

第3行（V1中为[6]，V2中为[7]）

这是我的代码的开始，我可以正确填充V1，但不能填充数据框的第二列。

### Read in the text file
temp <- readLines("C:/Program Files/R/Text Mining/testa.txt")

### Remove empty lines from the text file
temp <- temp[temp!=""]

### Create the temp char file as a list file
tmp <- as.list(temp)

### A days vector for searching through the list of days.
days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday")
df <- {}

### Loop through the list
for (n in 1:length(tmp)){

    ### Search to see if there is a day in the list item
    for(i in 1:length(days)){
            if(grepl(days[i], tmp[n])==1){
    ### Bind the row to the df if there is a day in the list item
                    df<- rbind(df, tmp[n])
            }
    }
### I know this is wrong, I am trying to create a vector to concatenate and add to the data frame, but I am struggling here.    
d <- c(d, tmp[n])
}

Answer 1

这是使用tidyverse的选项：

library(tidyverse)

text <- "[1] Thursday, August 25, 2016 at 3:57pm EDT

[2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???

[3]Sunday, August 14, 2016 at 9:17am EDT

[4]Michael shared Jason post.

[5]This bird is a lot smarter than the majority of political posts I have read recently here

[6]Sunday, August 14, 2016 at 8:44am EDT

[7]Michael and Kurt are now friends."

df <- data_frame(lines = read_lines(text)) %>%    # read data, set up data.frame
    filter(lines != '') %>%    # filter out empty lines
    # set grouping by cumulative number of rows with weekdays in them
    group_by(grp = cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), lines))) %>%
    # collapse each group to two columns
    summarise(V1 = lines[1], V2 = list(lines[-1]))

df
## # A tibble: 3 × 3
##     grp                                          V1        V2
##   <int>                                       <chr>    <list>
## 1     1 [1] Thursday, August 25, 2016 at 3:57pm EDT <chr [1]>
## 2     2    [3]Sunday, August 14, 2016 at 9:17am EDT <chr [2]>
## 3     3    [6]Sunday, August 14, 2016 at 8:44am EDT <chr [1]>

此方法使用V2的列表列，这可能是保留数据方面的最佳方法，但如果需要，请使用paste或toString。

大致相当的基数R：

df <- data.frame(V2 = readLines(textConnection(text)), stringsAsFactors = FALSE)

df <- df[df$V2 != '', , drop = FALSE]

df$grp <- cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), df$V2))

df$V1 <- ave(df$V2, df$grp, FUN = function(x){x[1]})

df <- aggregate(V2 ~ grp + V1, df, FUN = function(x){x[-1]})

df
##   grp                                          V1
## 1   1 [1] Thursday, August 25, 2016 at 3:57pm EDT
## 2   2    [3]Sunday, August 14, 2016 at 9:17am EDT
## 3   3    [6]Sunday, August 14, 2016 at 8:44am EDT
##                                                                                                                                                                   V2
## 1 [2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???
## 2                                        [4]Michael shared Jason post., [5]This bird is a lot smarter than the majority of political posts I have read recently here
## 3                                                                                                                               [7]Michael and Kurt are now friends.

通过循环文本创建数据框

1 个答案: