Question

我知道R中用于文本挖掘的一些软件包，比如tm，但是我无法将它用于我的任务。我有一个文本文件，其数据类似于：

 452924301037
    5May2014
       John
    7May2014
       Mark
       Sam
 452924302789
    6May2014
       Bill

我希望数据框中的上述数据如下所示：

UserID, Date, Names
452924301037,5May2014,John
452924301037,7May2014,Mark Sam
452924302789,6May2014,Bill

我怎样才能在R？

中这样做

示例2：

输入文本文件：

452924301037
    5May2014
       John
           Cricket
           Football
    7May2014
       Mark
           Hockey
452924302789
     6May2014
       Bill
           Billiards

我想设置一个数据框如下：

Game, Player, Date, UserID 
Cricket, John, 5May2014, 452924301037
Football, John, 5May2014, 452924301037
Hockey, Mark, 7May2014, 452924301037
Billiards, Bill, 6May2014, 452924302789

Answer 1

使用data.table和zoo的可能解决方案：

# read the textfile
txt <- readLines('textlines.txt')

# load the needed packages
library(zoo)
library(data.table)

# convert the text to a data.table (an enhanced form of a dataframe)
DT <- data.table(txt = txt)

# extract the info into new columns
DT[grepl('\\d+{8,}', txt), User_id := grep('\\d+{8,}', txt, value = TRUE)
   ][grepl('\\D+{3}\\d+{4}', txt), Date := txt
     ][, (c('User_id','Date')) := lapply(.SD, na.locf, na.rm = FALSE), .SDcols = 2:3
       ][txt!=User_id & txt != Date, .(Names = paste0(txt, collapse = ' ')), by = .(User_id, Date)]

给出：

        user_id     date    Names
1: 452924301037 5May2014     John
2: 452924301037 7May2014 Mark Sam
3: 452924302789 6May2014     Bill

要查看每个步骤的作用，请运行以下代码：

# extract the user_id's
DT[grepl('\\d+{8,}', txt), User_id := grep('\\d+{8,}', txt, value = TRUE)][]
# extract the dates
DT[grepl('\\D+{3}\\d+{4}', txt), Date := txt][]
# fill the NA-values of 'User_id' and 'Date' with na.locf from the zoo package
DT[, (c('User_id','Date')) := lapply(.SD, na.locf, na.rm = FALSE), .SDcols = 2:3][]
# filter out the rows where the 'txt'-column has either a 'User_id' or a 'Date'
# collapse the names into one string by 'User_id' and 'Date'
DT[txt != User_id & txt != Date, .(Names = paste0(txt, collapse = ' ')), by = .(User_id, Date)][]

对于添加的第二个例子，你可以这样做：

DT <- data.table(txt = trimws(txt))

DT[grepl('\\d+{8,}', txt), User_id := grep('\\d+{8,}', txt, value = TRUE)
   ][grepl('\\D+{3}\\d+{4}', txt), Date := txt
     ][, (c('User_id','Date')) := lapply(.SD, na.locf, na.rm = FALSE), .SDcols = 2:3
       ][txt!=User_id & txt != Date
         ][, Name := txt[1], by = .(User_id, Date)
           ][Name != txt]

给出：

         txt      User_id     Date Name
1:   Cricket 452924301037 5May2014 John
2:  Football 452924301037 5May2014 John
3:    Hockey 452924301037 7May2014 Mark
4: Billiards 452924302789 6May2014 Bill

数据透视转换

1 个答案: