我知道R中用于文本挖掘的一些软件包,比如tm,但是我无法将它用于我的任务。 我有一个文本文件,其数据类似于:
452924301037
5May2014
John
7May2014
Mark
Sam
452924302789
6May2014
Bill
我希望数据框中的上述数据如下所示:
UserID, Date, Names
452924301037,5May2014,John
452924301037,7May2014,Mark Sam
452924302789,6May2014,Bill
我怎样才能在R?
中这样做示例2:
输入文本文件:
452924301037
5May2014
John
Cricket
Football
7May2014
Mark
Hockey
452924302789
6May2014
Bill
Billiards
我想设置一个数据框如下:
Game, Player, Date, UserID
Cricket, John, 5May2014, 452924301037
Football, John, 5May2014, 452924301037
Hockey, Mark, 7May2014, 452924301037
Billiards, Bill, 6May2014, 452924302789
答案 0 :(得分:2)
使用data.table
和zoo
的可能解决方案:
# read the textfile
txt <- readLines('textlines.txt')
# load the needed packages
library(zoo)
library(data.table)
# convert the text to a data.table (an enhanced form of a dataframe)
DT <- data.table(txt = txt)
# extract the info into new columns
DT[grepl('\\d+{8,}', txt), User_id := grep('\\d+{8,}', txt, value = TRUE)
][grepl('\\D+{3}\\d+{4}', txt), Date := txt
][, (c('User_id','Date')) := lapply(.SD, na.locf, na.rm = FALSE), .SDcols = 2:3
][txt!=User_id & txt != Date, .(Names = paste0(txt, collapse = ' ')), by = .(User_id, Date)]
给出:
user_id date Names
1: 452924301037 5May2014 John
2: 452924301037 7May2014 Mark Sam
3: 452924302789 6May2014 Bill
要查看每个步骤的作用,请运行以下代码:
# extract the user_id's
DT[grepl('\\d+{8,}', txt), User_id := grep('\\d+{8,}', txt, value = TRUE)][]
# extract the dates
DT[grepl('\\D+{3}\\d+{4}', txt), Date := txt][]
# fill the NA-values of 'User_id' and 'Date' with na.locf from the zoo package
DT[, (c('User_id','Date')) := lapply(.SD, na.locf, na.rm = FALSE), .SDcols = 2:3][]
# filter out the rows where the 'txt'-column has either a 'User_id' or a 'Date'
# collapse the names into one string by 'User_id' and 'Date'
DT[txt != User_id & txt != Date, .(Names = paste0(txt, collapse = ' ')), by = .(User_id, Date)][]
对于添加的第二个例子,你可以这样做:
DT <- data.table(txt = trimws(txt))
DT[grepl('\\d+{8,}', txt), User_id := grep('\\d+{8,}', txt, value = TRUE)
][grepl('\\D+{3}\\d+{4}', txt), Date := txt
][, (c('User_id','Date')) := lapply(.SD, na.locf, na.rm = FALSE), .SDcols = 2:3
][txt!=User_id & txt != Date
][, Name := txt[1], by = .(User_id, Date)
][Name != txt]
给出:
txt User_id Date Name
1: Cricket 452924301037 5May2014 John
2: Football 452924301037 5May2014 John
3: Hockey 452924301037 7May2014 Mark
4: Billiards 452924302789 6May2014 Bill