我将数据存储在一个矩阵中,但是其中仍然有很多不必要的信息(由于从mhtml文件中获取数据的过程)。我想“过滤”出这些东西并“折叠”矩阵(这样数据之间就不会有空单元格),这样在将其保存到电子表格后,我不需要对其进行额外的清理(将当您需要为400多个文件进行操作时非常方便)。
但是,我唯一知道的方法是使用gsub
并在生成矩阵之前删除不需要的内容。
但是,由于我只需要矩阵的特定块,并且知道这些块在哪里(我可以使用which
来确定特定的单元格,使其位于需要的块之前一行)我当时在想,是否有可能在知道数据的开始和结束位置(块的固定大小)时复制出特定的数据块。
因此,当您知道数据块开始的单元格具有固定大小(如列和行)时,有人知道将Matrix的多个特定区域复制到单个不同的矩阵的方法吗?
我有点感觉,我监督了一些事情,因为听起来很简单。
编辑说:愚蠢的我,忘记了一个数据示例(希望它能奏效):
dput(var_table[1:20,1:6])
structure(c("coration:none", "", "Zeit", "kV", "-------------------------------------------------------",
"1", "2", "3", "4", "5", "6", "7", "8", "", "Phase", "Datum/Zeit",
"Stufe", "tan-delta-Mittelwert", "Standardabweichung", "Anzahl",
"color:000000\">Details:", NA, "Spannung", "mA", NA, "12:54:09",
"12:54:19", "12:54:30", "12:54:39", "12:54:49", "12:55:00", "12:55:10",
"12:55:20", NA, ".......................", "..................",
".......................", "........", "..........", "der", NA,
NA, "Strom", "E-3", NA, "5.8", "5.8", "5.8", "5.8", "5.8", "5.8",
"5.8", "5.8", NA, ":", ":", ":", ":", ":", "Messungen", NA, NA,
"tan", NA, NA, "3.07", "3.07", "3.07", "3.07", "3.07", "3.07",
"3.07", "3.07", NA, "L1", "29-09-2015", "1", "0.343", "0.001",
"........", NA, NA, "delta", NA, NA, "0.34", "0.34", "0.34",
"0.34", "0.34", "0.34", "0.34", "0.34", NA, NA, "12:55:20", NA,
"E-3", "E-3", ":", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, "8"), .Dim = c(20L, 6L))
只需要[6:13,1:5]中的数据块即可。
第二个数据段,相同的文件:
structure(c("Phase", "Datum/Zeit", "Stufe", "tan-delta-Mittelwert",
"Standardabweichung", "Anzahl", "Last", "Prfcfobjekt", "Generator",
"", "", "Zeit", "kV", "-------------------------------------------------------",
"1", "2", "3", "4", "5", "6", "7", "8", "", "Phase", "Datum/Zeit",
"Stufe", "tan-delta-Mittelwert", "Standardabweichung", "Anzahl",
"Last", "Prfcfobjekt", "Generator", ".......................",
"..................", ".......................", "........",
"..........", "der", "........................", "VSE-Strom",
"VSE-Strom", NA, NA, "Spannung", "mA", NA, "12:56:40", "12:56:50",
"12:57:00", "12:57:10", "12:57:21", "12:57:31", "12:57:41", "12:57:51",
NA, ".......................", "..................", ".......................",
"........", "..........", "der", "........................",
"VSE-Strom", "VSE-Strom", ":", ":", ":", ":", ":", "Messungen",
":", "........", ".........", NA, NA, "Strom", "E-3", NA, "11.7",
"11.7", "11.7", "11.7", "11.7", "11.7", "11.7", "11.7", NA, ":",
":", ":", ":", ":", "Messungen", ":", "........", ".........",
"L1", "29-09-2015", "1", "0.343", "0.001", "........", "847.6",
":", ":", NA, NA, "tan", NA, NA, "6.18", "6.18", "6.18", "6.18",
"6.18", "6.18", "6.18", "6.19", NA, "L1", "29-09-2015", "2",
"0.355", "0.001", "........", "843.2", ":", ":", NA, "12:55:20",
NA, "E-3", "E-3", ":", "nF", "32", "2", NA, NA, "delta", NA,
NA, "0.35", "0.35", "0.35", "0.36", "0.36", "0.36", "0.36", "0.36",
NA, NA, "12:57:52", NA, "E-3", "E-3", ":", "nF", "66", "6", NA,
NA, NA, NA, NA, "8", NA, "b5A", "b5A", NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "8", NA,
"b5A", "b5A"), .Dim = c(32L, 6L))
在这里,我只需要“阶段”(又名[15:4]和[38:4]),有人有想法吗?
答案 0 :(得分:0)
根据您的示例数据,这样的方法会起作用。
代码:
keepRows <- apply(df1,1,function(x){all(grepl("^(\\d|[:.])+$",x)|is.na(x))})
df2 <- df1[keepRows,]
keepCols <- apply(df2,2,function(x){!all(is.na(x))})
df2[,keepCols]
结果:
# [,1] [,2] [,3] [,4] [,5]
#[1,] "1" "12:54:09" "5.8" "3.07" "0.34"
#[2,] "2" "12:54:19" "5.8" "3.07" "0.34"
#[3,] "3" "12:54:30" "5.8" "3.07" "0.34"
#[4,] "4" "12:54:39" "5.8" "3.07" "0.34"
#[5,] "5" "12:54:49" "5.8" "3.07" "0.34"
#[6,] "6" "12:55:00" "5.8" "3.07" "0.34"
#[7,] "7" "12:55:10" "5.8" "3.07" "0.34"
#[8,] "8" "12:55:20" "5.8" "3.07" "0.34"
请注意:
integer
或numeric
。这些是2/3个不同的data.type。在矩阵中,您只能有1。
因此,第一步,我将转换as.data.frame()
。[NA, NUMBERS, : , . ]
的行。对于您的真实数据而言,这可能不够通用。?all
,?apply
,?grepl
,...并阅读!