根据匹配的多列

时间:2017-02-27 21:24:55

标签: r

以下是我尝试合并claimantsunemp的两个数据集的摘要和结构,我可以在claims.csvunemp.csv找到它们

 > tbl_df(claimants)
# A tibble: 6,960 × 5
       X       County  Month  Year Claimants
   <int>       <fctr> <fctr> <int>     <int>
1      1      ALAMEDA    Jan  2007     13034
2      2       ALPINE    Jan  2007        12
3      3       AMADOR    Jan  2007       487
4      4        BUTTE    Jan  2007      3496
5      5    CALAVERAS    Jan  2007       644
6      6       COLUSA    Jan  2007      1244
7      7 CONTRA COSTA    Jan  2007      8475
8      8    DEL NORTE    Jan  2007       328
9      9    EL DORADO    Jan  2007      2120
10    10       FRESNO    Jan  2007     19974
# ... with 6,950 more rows


> tbl_df(unemp)
# A tibble: 6,960 × 7
    County  Year Month laborforce emplab unemp unemprate
*    <chr> <int> <chr>      <int>  <int> <int>     <dbl>
1  Alameda  2007   Jan     743100 708300 34800       4.7
2  Alameda  2007   Feb     744800 711000 33800       4.5
3  Alameda  2007   Mar     746600 713200 33300       4.5
4  Alameda  2007   Apr     738200 705800 32400       4.4
5  Alameda  2007   May     739100 707300 31800       4.3
6  Alameda  2007   Jun     744900 709100 35800       4.8
7  Alameda  2007   Jul     749600 710900 38700       5.2
8  Alameda  2007   Aug     746700 709600 37000       5.0
9  Alameda  2007   Sep     748200 712100 36000       4.8
10 Alameda  2007   Oct     749000 713000 36100       4.8
# ... with 6,950 more rows

我首先想到的是我应该将所有factor列更改为character列。

unemp[sapply(unemp, is.factor)] <- lapply(unemp[sapply(unemp, is.factor)], as.character)

claimants[sapply(claimants, is.factor)] <- lapply(claimants[sapply(claimants, is.factor)], as.character)

m <-merge(unemp, claimants, by = c("County", "Month", "Year"))
dim(m)
[1]  0 10

dim(m)的输出中,结果数据帧中有0行。所有6960行应该唯一匹配。

要验证两个数据框是否有3列'County','Month'和'Year'的唯一组合,我会重新排序并重新排列数据框中的这些列,如下所示:

a <- claimants[ order(claimants[,"County"], claimants[,"Month"], claimants[,"Year"]), ]

b <- unemp[ order(unemp[,"County"], unemp[,"Month"], unemp[,"Year"]), ]

b[2:4] <- b[c(2,4,3)]
a[2:4] %in% b[2:4]
[1] TRUE TRUE TRUE

最后一个输出确认所有'County','Month'和'Year'列在这两个数据帧中相互匹配。

我已尝试查看merge的文档但无法收集到哪里出错,我还尝试了inner_join中的dplyr函数:

> m <- inner_join(unemp[2:8], claimants[2:5])
Joining, by = c("County", "Year", "Month")
> dim(m)
[1] 0 8 

我遗漏了一些东西而且不知道是什么,非常感谢理解这一点的帮助,我知道我不应该通过三列重新排列行来运行merge R应该识别匹配的行和合并不匹配的列。

1 个答案:

答案 0 :(得分:2)

申请人df的所有县都是大写的,非法的df是小写的。

我在读取数据时使用了选项(stringsAsFactors = FALSE)。一些建议在两者中都删除了X列,它似乎没用。

#test lists--li is the orignal one provided by Button
li = [0, 2, [[2, 3], 8, 100, None, [[None]]], -2]
li1 = [-100, -100, [[[None,None]]]]
li2 = [[[[[None,None,1,2,3]]]], 6, 0, 0, 0]
li3 = [None, [None], 56, 78, None]
li4 = [[[[[None,1,2,3]]]], 6, 0, 0, 0]

#solution is theta(n) or more specifically O(n)
#which is the best case solution since we must
#loop the entire list

def flatten(li):
    i = 0
    while i < len(li):

        #only execute if the element is a list
        while isinstance(li[i], list):

        #taking the element at index i and sets it as the
        #i'th part of the list. so if l[i] contains a list
        #it is then unrolled or 'unlisted'

        li[i:i + 1] = li[i]

        i += 1

    #for li: for some reason the 2nd None at
    #index 7 is an int, probably because there
    #might've been an int at that index before manipulation?

    #for li1: the 2nd None or element at index 3
    #is of class 'NoneType' but the removal is not
    #occuring.. 

    for element in li:
        if element is None:
            li.remove(element)


    #conclusion: there is always one None remaining if
    #there is more than one None to begin with..
    return li

def main():
    flatten(li)
    print(li)
    flatten(li1)
    print(li1)
    flatten(li2)
    print(li2)
    flatten(li3)
    print(li3)
    flatten(li4)
    print(li4)

if __name__ == '__main__':
   main()