对于每行数据帧,检查是否存在重复值

时间:2016-09-07 15:52:07

标签: r dataframe

我的数据框包含以下值:

URL                  Response.Code Count
www.site.com/page1   200             4
www.site.com/page1   301             1
www.site.com/page2   200             5
www.site.com/page3   301             4
www.site.com/page4   200             4
www.site.com/page4   403             1

对于URL的每个唯一值,我想知道是否存在多个Response.Code值。如果只存在一个组合URL / Response.Code,则URL是一致的。期望的输出是这样的数据帧:

  URL                  Consistent
  www.site.com/page1   FALSE
  www.site.com/page2   TRUE
  www.site.com/page3   TRUE
  www.site.com/page4   FALSE  

我可以为每个唯一的URL做一个循环,并检查Response.Code中不同值的数量,但它看起来不像是解决这个问题的R方式。

有关解决此问题的最佳方法的任何建议吗?我是R& S的新手在这里检查了有关重复的多个问题,但似乎没有为这个特定问题找到解决方案。

4 个答案:

答案 0 :(得分:3)

您可以使用base R aggregate

aggregate(Response.Code~URL, df, length)[2] == 1

#     Response.Code
#[1,]         FALSE
#[2,]         TRUE
#[3,]         TRUE
#[4,]         FALSE

如果您想要所需格式的输出,那么您可以

agg <- aggregate(Response.Code~URL, df, length)
new_df <- data.frame(URL = agg$URL, Consistent = agg$Response.Code == 1)
new_df
#    URL               Consistent
#1 www.site.com/page1      FALSE
#2 www.site.com/page2      TRUE
#3 www.site.com/page3      TRUE
#4 www.site.com/page4      FALSE

答案 1 :(得分:2)

我们可以使用data.table。将'data.frame'转换为'data.table'(setDT(df1)),按'URL'分组,我们检查行数是否等于1.

library(data.table)
setDT(df1)[, .(Consistent = .N ==1), by = URL]
#                 URL Consistent
#1: www.site.com/page1      FALSE
#2: www.site.com/page2       TRUE
#3: www.site.com/page3       TRUE
#4: www.site.com/page4      FALSE

或者,如果我们检查'Response.Code'中的lengthunique元素为1,我们可以在使用'URL'进行分组后使用uniqueN

setDT(df1)[, .(Consistent = uniqueN(Response.Code)==1), by = URL]
#                  URL Consistent
#1: www.site.com/page1      FALSE
#2: www.site.com/page2       TRUE
#3: www.site.com/page3       TRUE
#4: www.site.com/page4      FALSE

答案 2 :(得分:1)

我们也可以选择帽子戏法(base,data.table和dplyr)

df1 <- structure(list(URL = c("www.site.com/page1", "www.site.com/page1", 
    "www.site.com/page2", "www.site.com/page3", "www.site.com/page4", 
    "www.site.com/page4"), Response.Code = c(200L, 301L, 200L, 301L, 
    200L, 403L), Count = c(4L, 1L, 5L, 4L, 4L, 1L)), .Names = c("URL", 
    "Response.Code", "Count"), class = "data.frame", row.names = c(NA, 
    -6L))

df1 %>%
  group_by(URL) %>%
  summarise(Consistent = n_distinct(Response.Code) == 1)

答案 3 :(得分:0)

假设您的数据框名为x,那么可以运行的一件事就是

x$consistent <- duplicated(x[,1:2]) | duplicated(x[,1:2], fromLast = TRUE)

将仅检查前两列中的重复项,并将TRUE / FALSE值写入新列。默认情况下,duplicated()不会为重复行的所有实例返回TRUE。默认情况下,第一个实例将为FALSE,第一个实例之后的所有后续行将为TRUE。通过使用fromLast = TRUE和不使用TRUE使x $一致为TRUE,我确保所有实例都以y <- x[!(duplicated(x$URL), c(1,4)] 结束。

如果你想要输出就像你说的那样,你可以运行它来删除重复的URL和额外的列:

duplicated()

这会得到您正在寻找的结果,但如果您对其他内容感兴趣,我建议您阅读文档中的18:13:55.254 [main] INFO o.s.c.a.AnnotationConfigApplicationContext - Refreshing org.springframework.context.annotation.AnnotationConfigApplicationContext@b3d7190: startup date [Wed Sep 07 18:13:55 CEST 2016]; root of context hierarchy 18:13:55.403 [main] WARN o.s.c.a.AnnotationConfigApplicationContext - Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.BeanDefinitionStoreException: Failed to process import candidates for configuration class [el.dorado.App]; nested exception is java.lang.IllegalArgumentException: No auto configuration classes found in META-INF/spring.factories. If you are using a custom packaging, make sure that file is correct. 18:13:55.414 [main] ERROR o.s.boot.SpringApplication - Application startup failed org.springframework.beans.factory.BeanDefinitionStoreException: Failed to process import candidates for configuration class [el.dorado.App]; nested exception is java.lang.IllegalArgumentException: No auto configuration classes found in META-INF/spring.factories. If you are using a custom packaging, make sure that file is correct. at org.springframework.context.annotation.ConfigurationClassParser.processDeferredImportSelectors(ConfigurationClassParser.java:489) at org.springframework.context.annotation.ConfigurationClassParser.parse(ConfigurationClassParser.java:191) at org.springframework.context.annotation.ConfigurationClassPostProcessor.processConfigBeanDefinitions(ConfigurationClassPostProcessor.java:321) at org.springframework.context.annotation.ConfigurationClassPostProcessor.postProcessBeanDefinitionRegistry(ConfigurationClassPostProcessor.java:243) at org.springframework.context.support.PostProcessorRegistrationDelegate.invokeBeanDefinitionRegistryPostProcessors(PostProcessorRegistrationDelegate.java:273) at org.springframework.context.support.PostProcessorRegistrationDelegate.invokeBeanFactoryPostProcessors(PostProcessorRegistrationDelegate.java:98) at org.springframework.context.support.AbstractApplicationContext.invokeBeanFactoryPostProcessors(AbstractApplicationContext.java:681) at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:523) at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:759) at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:369) at org.springframework.boot.SpringApplication.run(SpringApplication.java:313) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1185) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1174) at dz.lab.jpmtask.App.main(App.java:33) Caused by: java.lang.IllegalArgumentException: No auto configuration classes found in META-INF/spring.factories. If you are using a custom packaging, make sure that file is correct. at org.springframework.util.Assert.notEmpty(Assert.java:276) at org.springframework.boot.autoconfigure.EnableAutoConfigurationImportSelector.getCandidateConfigurations(EnableAutoConfigurationImportSelector.java:145) at org.springframework.boot.autoconfigure.EnableAutoConfigurationImportSelector.selectImports(EnableAutoConfigurationImportSelector.java:84) at org.springframework.context.annotation.ConfigurationClassParser.processDeferredImportSelectors(ConfigurationClassParser.java:481) ... 13 common frames omitted