如何使用ldply将.txt文件(约100万行)转换为可用的df?

时间:2015-03-23 19:36:35

标签: r function plyr

所以我有一个非常大的.txt文件,其中包含没有标准分隔符的字符串和数字值。它看起来像这样:

MIO Data Packet:
Event Node:099123910e373b4a9c59114ee9e6d83c
    TrasducerValue:
        Name: Thermometer Digital
        ID: 0
        Raw Value: 138
        Typed Value: 13.800000
        Timestamp: 2015-03-18T09:22:59.703168-0500
    TrasducerValue:
        Name: Thermometer Analog
        ID: 0
        Raw Value: 550
        Typed Value: 13.350000
        Timestamp: 2015-03-18T09:22:59.703168-0500
    TrasducerValue:
        Name: RSSI
        ID: 0
        Raw Value: 12
        Typed Value: 12.000000
        Timestamp: 2015-03-18T09:22:59.703168-0500
    TrasducerValue:
        Name: Ping
        ID: 0
        Raw Value: 0
        Typed Value: 0.000000
        Timestamp: 2015-03-18T09:22:59.703168-0500
    TrasducerValue:
        Name: Motion Sensor
        ID: 0
        Raw Value: 0
        Typed Value: 0.000000
        Timestamp: 2015-03-18T09:22:59.703168-0500
    TrasducerValue:
        Name: Microphone
        ID: 0
        Raw Value: 82
        Typed Value: 82.000000
        Timestamp: 2015-03-18T09:22:59.703168-0500
    TrasducerValue:
        Name: Light Meter
        ID: 0
        Raw Value: 1023
        Typed Value: 0.000000
        Timestamp: 2015-03-18T09:22:59.703168-0500
    TrasducerValue:
        Name: Humidity Sensor
        ID: 0
        Raw Value: 158
        Typed Value: 46.666668
        Timestamp: 2015-03-18T09:22:59.703168-0500
    TrasducerValue:
        Name: Battery Level
        ID: 0
        Raw Value: 267
        Typed Value: 2.670000
        Timestamp: 2015-03-18T09:22:59.703168-0500
    TrasducerValue:
        Name: Barometer
        ID: 0
        Raw Value: 99103
        Typed Value: 99103.000000
        Timestamp: 2015-03-18T09:22:59.703168-0500
    TrasducerValue:
        Name: Accelerometer Z
        ID: 0
        Raw Value: 563
        Typed Value: 0.396364
        Timestamp: 2015-03-18T09:22:59.703168-0500
    TrasducerValue:
        Name: Accelerometer Y
        ID: 0
        Raw Value: 606
        Typed Value: 8.269162
        Timestamp: 2015-03-18T09:22:59.703168-0500
    TrasducerValue:
        Name: Accelerometer X
        ID: 0
        Raw Value: 507
        Typed Value: 1.181309
        Timestamp: 2015-03-18T09:22:59.703168-0500

我已经开始使用:

library("stringr")
library("plyr")
dat = readLines("03181023.txt")

我感觉我需要使用的命令是

x = ldply(dat, .fun)

但是我对创建函数知之甚少,所以在正确使用ldply()命令方面有点不知所措。

我希望在完成后数据看起来像这样。 (其余的值当然填写了)

Name    ID  Raw Value   Typed Value Timestamp
Thermometer Digital 0   138 13.80000    2015-03-18T09:22:59.703168-0500
Thermometer Analog              
RSSI                
Ping                
Motion Sensor               
Microphone              
Light Meter             
Humidity Sensor             

感谢您的任何建议!

1 个答案:

答案 0 :(得分:0)

我已使用Extracting decimal numbers from a stringExtracting Data from Text Files中的信息起草下面的函数。

txtconvert <- function(file)
{
tmp <- readLines(file) # use readLines to read in the .txt file
tmp <- grep("Name: |ID: |Raw Value: |Typed Value: |Timestamp: ", tmp,
value = TRUE) # search for the column names and retrieve the 
# corresponding value
tmp <- gsub("        ", "", tmp) # remove the spaces at the beginning
tmp <- gsub(": ", "\t", tmp) # substitution to make tmp readable by 
# read.table

# Name
name <- grep("Name", tmp, value = TRUE) # collect all Name values together
name <- read.table(textConnection(name), sep = "\t",
stringsAsFactors = FALSE) # read the lines as a table
names(name)[2] <- "Name" # change the column name
name[1] <- NULL # remove the 1st column

# ID
ID <- grep("ID", tmp, value = TRUE) # collect all ID values together
ID <- read.table(textConnection(ID), sep = "\t", stringsAsFactors = FALSE)
# read the lines as a table
names(ID)[2] <- "ID" # change the column name
ID[1] <- NULL # remove the 1st column

# Raw Value
raw <- grep("Raw Value", tmp, value = TRUE) # collect all Raw Value 
# values together
raw <- read.table(textConnection(raw), sep = "\t", stringsAsFactors = FALSE)
# read the lines as a table
names(raw)[2] <- "Raw Value" # change the column name
raw[1] <- NULL # remove the 1st column

# Typed Value
type <- grep("Typed Value", tmp, value = TRUE) # collect all Typed Value 
# values together
type <- read.table(textConnection(type), sep = "\t", 
stringsAsFactors = FALSE) # read the lines as a table
names(type)[2] <- "Typed Value" # change the column name
type[1] <- NULL # remove the 1st column

# Timestamp
time <- grep("Timestamp", tmp, value = TRUE) # collect all Timestamp 
# values together
time <- read.table(textConnection(time), sep = "\t", 
stringsAsFactors = FALSE)
names(time)[2] <- "Timestamp" # change the column name
time[1] <- NULL # remove the 1st column

tmp <- data.frame(name, ID, raw, type, time) # combine into
# a single data.frame
names(tmp)[3:4] <- c("Raw Value", "Typed Value") # change the column names
return(tmp)
}

此功能不使用ldply,但它仍然为您提供所需的data.frame


dataout <- txtconvert("data.txt") # data.txt contains all of the data 
# that you provided in your initial question
dataout

以下是 dataout

#                Name   ID  Raw Value   Typed Value Timestamp 
# 1 Thermometer Digital 0   138 13.800000   2015-03-18T09:22:59.703168-0500
# 2 Thermometer Analog  0   550 13.350000   2015-03-18T09:22:59.703168-0500
# 3 RSSI    0   12  12.000000   2015-03-18T09:22:59.703168-0500
# 4 Ping    0   0   0.000000    2015-03-18T09:22:59.703168-0500
# 5 Motion Sensor   0   0   0.000000    2015-03-18T09:22:59.703168-0500
# 6 Microphone  0   82  82.000000   2015-03-18T09:22:59.703168-0500
# 7 Light Meter 0   1023    0.000000    2015-03-18T09:22:59.703168-0500
# 8 Humidity Sensor 0   158 46.666668   2015-03-18T09:22:59.703168-0500
# 9 Battery Level   0   267 2.670000    2015-03-18T09:22:59.703168-0500
# 10    Barometer   0   99103   99103.000000    2015-03-18T09:22:59.703168-0500
# 11    Accelerometer Z 0   563 0.396364    2015-03-18T09:22:59.703168-0500
# 12    Accelerometer Y 0   606 8.269162    2015-03-18T09:22:59.703168-0500
# 13    Accelerometer X 0   507 1.181309    2015-03-18T09:22:59.703168-0500


dataout <- structure(list(Name = c("Thermometer Digital", "Thermometer     
Analog", "RSSI", "Ping", "Motion Sensor", "Microphone", "Light Meter", 
"Humidity Sensor", "Battery Level", "Barometer", "Accelerometer Z", 
"Accelerometer Y", "Accelerometer X"), ID = c(0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), `Raw Value` = c(138L, 550L, 
12L, 0L, 0L, 82L, 1023L, 158L, 267L, 99103L, 563L, 606L, 507L), 
`Typed Value` = c(13.8, 13.35, 12, 0, 0, 82, 0, 46.666668, 2.67, 99103,  
0.396364, 8.269162, 1.181309), Timestamp = c("2015-03-18T09:22:59.703168-   
0500", "2015-03-18T09:22:59.703168-0500", "2015-03-18T09:22:59.703168-
0500", "2015-03-18T09:22:59.703168-0500", "2015-03-18T09:22:59.703168-
0500", "2015-03-18T09:22:59.703168-0500", "2015-03-18T09:22:59.703168- 
0500", "2015-03-18T09:22:59.703168-0500", "2015-03-18T09:22:59.703168-
0500", "2015-03-18T09:22:59.703168-0500", "2015-03-18T09:22:59.703168-
0500", "2015-03-18T09:22:59.703168-0500", "2015-03-18T09:22:59.703168- 
0500")), .Names = c("Name", "ID", "Raw Value", "Typed Value", "Timestamp"
), row.names = c(NA, -13L), class = "data.frame")