我有一张excel表,看起来像:
Col1 Col2
IJ-123 A2B1
IJ-123 A2B1
IJ-456 C2C2
IJ-456 c2c2
IJ-456 D1e2
IJ-789 LJ87
IJ-456
IJ-789 LJ98
x = data.frame(
Col1 = c("IJ-123", "IJ-123", "IJ-456", "IJ-456",
"IJ-456", "IJ-789", "IJ-456", "IJ-789"),
Col2 = c("A2B1", "A2B1", "C2C2", "c2c2",
"D1e2", "LJ87", NA, "LJ98")
)
我想再添加一列并检查(针对每个唯一Col2
值)Col1
中指定的值是TRUE
还是FALSE
。
输出:
Col1 Col2 Result
IJ-123 A2B1 TRUE
IJ-123 A2B1 TRUE
IJ-456 C2C2 TRUE
IJ-456 c2c2 TRUE
IJ-456 D1e2 FALSE
IJ-789 LJ87 TRUE (Because Col2 count=1 for this value)
IJ-456 C2C2
IJ-789 LJ98 TRUE (Because Col2 count=1 for this value)
逻辑:
Col2
,那么Col1
中的某些字段为空
值显示映射到Result中Col2
的{{1}}值(请参阅第7行)。对于这个我有一个excel公式Col1
,但它的工作非常慢,因为等待~4小时它只能完成28%的处理~20万个数据。
我已在R上以=IF(COUNTIF($B$2:$B$8,B2)=1,SUMPRODUCT(--(($A$2:$A$8=A2)*(COUNTIF($B$2:$B$8,$B$2:$B$8))>1))=0,COUNTIFS($B$2:$B$8,B2,$A$2:$A$8,"<>"&A2)=0)
格式上传文件,并希望在R上执行相同的练习以加快处理速度。
答案 0 :(得分:1)
与往常一样,我建议使用data.table
library(data.table)
setDT(x) # convert your data.frame to data.table to unlock syntax
# convert to lowercase
x[ , Col2 := tolower(Col2)]
# count how many observations are associated with each Col2 value
x[ , col2_count := .N, by = Col2]
# first deal with rows where Col2 is non-missing
x[!is.na(Col2), Result := {
# when there's more than one value in Col2,
# TRUE if and only if there's exactly one unique value in Col1
if (.N > 1) uniqueN(Col1) == 1L
# otherwise, TRUE if and only if Col1 is _not_ found among the
# Col1 values associated with the Col2 rows for which there are
# multiple observations of that Col2 (i.e, col2_count > 1)
else !Col1 %in% x[col2_count > 1, unique(Col1)]
}, by = Col2]
# now, deal with the missing rows case, adding a flag to
# record that we've done so
x[is.na(Col2), c('Col2', 'col2_flag') :=
# use the rows of the subset data.table to look up
# the non-missing rows from X with the same Col1,
# and take the _first_ observed value of Col2
x[!is.na(Col1)][copy(.SD), .(Col2, TRUE), on = 'Col1', mult = 'first']
]
答案 1 :(得分:1)
尝试dplyr:
require(dplyr)
x$Col2 <- toupper(x$Col2) #make all letters same case.
x_assigned <- x %>% group_by(Col2,Col1) %>%
summarise(n = n()) %>% #counts the number of occurrences
group_by(Col1) %>% arrange(desc(n)) %>% # arranges so that the highest count per Col1 is first
mutate(assigned = if (first(n) == 1) { #this conditional statement will assign the 'correct' Col2 value to your Col1 value
Col2
} else if (first(n) > 1) {
first(Col2)
},
test = assigned == Col2)
x_assigned
# A tibble: 6 x 5
# Groups: Col1 [3]
Col2 Col1 n assigned test
<chr> <chr> <int> <chr> <lgl>
1 A2B1 I-123 2 A2B1 T
2 C2C2 I-456 2 C2C2 T
3 D1E2 I-456 1 C2C2 F
4 LJ87 I-789 1 LJ87 T
5 LJ98 I-789 1 LJ98 T
6 <NA> I-456 1 C2C2 NA
为了获得所需的结果,您可以进行x和x_assigned的简单左连接:
left_join(x, x_assigned, by = c('Col1', 'Col2'))
您可以通过这种方式查看缺少值的位置,并轻松指定您的“正确”字样。 Col2值。抱歉,如果我误解了你的问题,我仍然不确定你如何分配“正确的”#39; Col2值为Col1值
答案 2 :(得分:0)
我首先要汇总数据并将这些数据添加为另外两列:
library(dplyr)
# Create dummy dataframe
Col1 <- c("IJ-123", "IJ-123", "IJ-456", "IJ-456", "IJ-456", "IJ-789", "IJ-456", "IJ-789")
Col2 <- c("A2B1", "A2B1", "C2C2", "c2c2", "D1e2", "LJ87", "C2C2", "LJ98")
df <- data.frame(Col1, Col2)
# Aggregate data - Col2 Vs Col1 and Col1 Vs Col2
Col2vsCol1 <- aggregate(Col1 ~ Col2, data = df, paste, collapse = ",")
colnames(Col2vsCol1)[2] <- "Col2vsCol1"
Col1vsCol2 <- aggregate(Col2 ~ Col1, data = df, paste, collapse = ",")
colnames(Col1vsCol2)[2] <- "Col1vsCol2"
# Outer join these as two extra columns to original df:
df <- merge(x = df, y = Col2vsCol1, by = "Col2", all = TRUE)
df <- merge(x = df, y = Col1vsCol2, by = "Col1", all = TRUE)
然后,您可以使用这些列对以下内容执行逻辑检查:
+----------------------------------------------------------+
| Col1 Col2 Col2vsCol1 Col1vsCol2 |
+----------------------------------------------------------+
| 1 IJ-123 A2B1 IJ-123,IJ-123 A2B1,A2B1 |
| 2 IJ-123 A2B1 IJ-123,IJ-123 A2B1,A2B1 |
| 3 IJ-456 c2c2 IJ-456 C2C2,c2c2,D1e2,C2C2 |
| 4 IJ-456 C2C2 IJ-456,IJ-456 C2C2,c2c2,D1e2,C2C2 |
| 5 IJ-456 C2C2 IJ-456,IJ-456 C2C2,c2c2,D1e2,C2C2 |
| 6 IJ-456 D1e2 IJ-456 C2C2,c2c2,D1e2,C2C2 |
| 7 IJ-789 LJ87 IJ-789 LJ87,LJ98 |
| 8 IJ-789 LJ98 IJ-789 LJ87,LJ9898 |
+----------------------------------------------------------+