我正在尝试在 R 中编写程序,该程序从.csv文件创建一个表,该表将是1856 x 9项。这部分有效。然后,我想循环遍历该表的每个单元格,从表格的右上角开始,然后向下行,然后下拉到下一行并执行相同的操作。
如果行全部为零,或者有1 1 1 0 0 0
或类似的行,我想删除它。如果该行具有所有非零值,然后右侧为零值,则将其删除
如果在具有零值的单元格右侧的单元格中存在非零值,我希望将该行保留在表格中。
示例:
我的代码运行后,我只想要保留第1,2,3,7行。
答案 0 :(得分:2)
您可以使用apply而不是循环:
# recreate your example
DF <-
read.csv(
text="Company.Name,Seed,Series.A,Series.B,Series.C,Series.D,Series.E,Series.F,Series.G,Series.H
Aetion,0,1,0,0,0,0,0,0,0
Aspier Healt,1,0,1,0,0,0,0,0,0
Evariant,0,1,1,2,0,0,0,0,0
iHealth,0,0,0,0,0,0,0,0,0
Inuition Robotics,0,0,0,0,0,0,0,0,0
Kali Care,0,0,0,0,0,0,0,0,0
Network Locum,0,0,1,0,0,0,0,0,0
"
)
# This line does :
# - for each row of DF excluding the first column DF[,-1]
# - take the row without the last value x[-length(x)] and the
# row without the first value x[-1]
# - create a vector with TRUE where x[-length(x)] == 0 AND x[-1] != 0
# so basically when a zero is followed by a non-zero
# - if there's "any" (see the function) TRUE, then the condition is met
# rowCondition will contain TRUE where the row condition is met, and FALSE otherwise
rowCondition <- apply(DF[,-1],1,function(x) any(x[-length(x)] == 0 & x[-1] != 0))
# we use the condition to filter the necessary rows
subsetDF <- DF[rowCondition,]
> subsetDF
Company.Name Seed Series.A Series.B Series.C Series.D Series.E Series.F Series.G Series.H
1 Aetion 0 1 0 0 0 0 0 0 0
2 Aspier Healt 1 0 1 0 0 0 0 0 0
3 Evariant 0 1 1 2 0 0 0 0 0
7 Network Locum 0 0 1 0 0 0 0 0 0
答案 1 :(得分:1)
当你正在寻找有0后跟非零字符的任何行时,可以使用正则表达式来执行此操作。 grepl
函数根据指定的pattern
是否匹配返回TRUE / FALSE向量:
examples <- c("100", "000", "001")
grepl(pattern = "0[1-9]", x = examples)
## [1] FALSE FALSE TRUE
这个正则表达式明确地在零之后查找数字1-9,你想要除了零之外的任何可能的字符pattern = "0[^0]"
使用通过调用dplyr
加载的library("tidyverse")
库,可以非常简单地连接感兴趣的列,然后将我们的正则表达式应用于此新列。
首先,将以下内容另存为.csv
Company.Name,种子,Series.A,Series.B,Series.C,Series.D,Series.E,Series.F,Series.G,Series.H Aetion,0,1,0,0,0,0,0,0,0 Aspier Healt,1,0,1,0,0,0,0,0,0 Evariant,0,1,1,2,0,0,0,0,0 iHealth,0,0,0,0,0,0,0,0,0 Inuition Robotics,0,0,0,0,0,0,0,0,0 Kali Care,0,0,0,0,0,0,0,0,0 Network Locum,0,0,1,0,0,0,0,0,0 Martin Company,0,0,0,0,0,0,0,0,1 其他公司,1,1,1,2,1,3,6,7,9 奇怪的公司,0,0,0,0,m,0,0,0,0
然后使用read_csv导入数据:
library("tidyverse")
example_data <- read_csv("example_data.csv")
现在让我们创建一个新列,其中包含行的串联种子:Series.H
example_data <- example_data %>%
mutate(test_col = paste0(Seed,
Series.A,
Series.B,
Series.C,
Series.D,
Series.E,
Series.F,
Series.G,
Series.H))
让我们看一下第一行的新列值:
example_data %>%
select(test_col) %>%
slice(1)
## 010000000
好!所以在零的右边有一个非零字符!所以这一行应该包含在输出中。
我们可以使用mutate
动词在名为include的新列中的所有行中应用grepl
测试。让我们打印出整个列,看看哪些行符合您的条件:
example_data %>%
mutate(include = grepl("0[1-9]", test_col)) %>%
select(include)
## output
# A tibble: 10 x 1
include
<lgl>
1 TRUE
2 TRUE
3 TRUE
4 FALSE
5 FALSE
6 FALSE
7 TRUE
8 TRUE
9 FALSE
10 FALSE
要仅过滤条件为true的那些行,我们使用filter
动词:
example_data %>%
mutate(include = grepl("0[1-9]", test_col)) %>%
filter(include)
当然,我们现在在您不想要的数据中有两列!所以让我们简明扼要地写下这些:
example_data %>%
mutate(test_col = paste0(Seed,
Series.A,
Series.B,
Series.C,
Series.D,
Series.E,
Series.F,
Series.G,
Series.H),
include = grepl("0[1-9]", test_col)) %>%
filter(include) %>%
select(-include, -test_col)