通过循环文本创建数据框

时间:2016-11-22 04:58:00

标签: r list loops text dataframe

提前致谢!我已经尝试了几天,我有点卡住了。我试图循环文本文件(作为列表导入),并从文本文件创建一个数据框。如果列表中的项目在文本中具有星期几,则数据框将开始新行,并将填充在第一列(V1)中。我想将其余的注释放在第二列(V2)中,我可能需要将字符串连接在一起。我试图在grepl()中使用条件,但是在设置初始数据帧之后,我对逻辑很失落。

这是我带入R的示例文本(它是来自文本文件的Facebook数据)。 []表示列表编号。这是一个冗长的文件(50K +行),但我设置了日期列。

[1] 2016年8月25日星期四美国东部时间下午3:57

[2] 足球时间!!我们需要制定计划!!!!我给我的家伙发了短信,虽然去年还没有接触过。所以我们会在最后看到!!!你有什么做饭???

[3] 2016年8月14日星期日美国东部时间上午9:17

[4]迈克尔分享杰森的帖子。

[5]这只鸟比我最近在这里阅读的大多数政治帖子更聪明

[6] 2016年8月14日星期日美国东部时间上午8:44

[7] Michael和Kurt现在是朋友。

最终结果将是数据框,其中星期几在数据框中开始新行,并且列表的其余部分连接到数据框的第二列。所以最终数据的名气将是

第1行(V1中为[1],V2中为[2])

第2行(V1中的[3]和V2中的[4],[5])

第3行(V1中为[6],V2中为[7])

这是我的代码的开始,我可以正确填充V1,但不能填充数据框的第二列。

### Read in the text file
temp <- readLines("C:/Program Files/R/Text Mining/testa.txt")

### Remove empty lines from the text file
temp <- temp[temp!=""]

### Create the temp char file as a list file
tmp <- as.list(temp)

### A days vector for searching through the list of days.
days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday")
df <- {}

### Loop through the list
for (n in 1:length(tmp)){

    ### Search to see if there is a day in the list item
    for(i in 1:length(days)){
            if(grepl(days[i], tmp[n])==1){
    ### Bind the row to the df if there is a day in the list item
                    df<- rbind(df, tmp[n])
            }
    }
### I know this is wrong, I am trying to create a vector to concatenate and add to the data frame, but I am struggling here.    
d <- c(d, tmp[n])
}

1 个答案:

答案 0 :(得分:1)

这是使用tidyverse的选项:

library(tidyverse)

text <- "[1] Thursday, August 25, 2016 at 3:57pm EDT

[2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???

[3]Sunday, August 14, 2016 at 9:17am EDT

[4]Michael shared Jason post.

[5]This bird is a lot smarter than the majority of political posts I have read recently here

[6]Sunday, August 14, 2016 at 8:44am EDT

[7]Michael and Kurt are now friends."

df <- data_frame(lines = read_lines(text)) %>%    # read data, set up data.frame
    filter(lines != '') %>%    # filter out empty lines
    # set grouping by cumulative number of rows with weekdays in them
    group_by(grp = cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), lines))) %>%
    # collapse each group to two columns
    summarise(V1 = lines[1], V2 = list(lines[-1]))

df
## # A tibble: 3 × 3
##     grp                                          V1        V2
##   <int>                                       <chr>    <list>
## 1     1 [1] Thursday, August 25, 2016 at 3:57pm EDT <chr [1]>
## 2     2    [3]Sunday, August 14, 2016 at 9:17am EDT <chr [2]>
## 3     3    [6]Sunday, August 14, 2016 at 8:44am EDT <chr [1]>

此方法使用V2的列表列,这可能是保留数据方面的最佳方法,但如果需要,请使用pastetoString

大致相当的基数R:

df <- data.frame(V2 = readLines(textConnection(text)), stringsAsFactors = FALSE)

df <- df[df$V2 != '', , drop = FALSE]

df$grp <- cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), df$V2))

df$V1 <- ave(df$V2, df$grp, FUN = function(x){x[1]})

df <- aggregate(V2 ~ grp + V1, df, FUN = function(x){x[-1]})

df
##   grp                                          V1
## 1   1 [1] Thursday, August 25, 2016 at 3:57pm EDT
## 2   2    [3]Sunday, August 14, 2016 at 9:17am EDT
## 3   3    [6]Sunday, August 14, 2016 at 8:44am EDT
##                                                                                                                                                                   V2
## 1 [2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking???
## 2                                        [4]Michael shared Jason post., [5]This bird is a lot smarter than the majority of political posts I have read recently here
## 3                                                                                                                               [7]Michael and Kurt are now friends.