以下是我尝试合并claimants
和unemp
的两个数据集的摘要和结构,我可以在claims.csv和unemp.csv找到它们
> tbl_df(claimants)
# A tibble: 6,960 × 5
X County Month Year Claimants
<int> <fctr> <fctr> <int> <int>
1 1 ALAMEDA Jan 2007 13034
2 2 ALPINE Jan 2007 12
3 3 AMADOR Jan 2007 487
4 4 BUTTE Jan 2007 3496
5 5 CALAVERAS Jan 2007 644
6 6 COLUSA Jan 2007 1244
7 7 CONTRA COSTA Jan 2007 8475
8 8 DEL NORTE Jan 2007 328
9 9 EL DORADO Jan 2007 2120
10 10 FRESNO Jan 2007 19974
# ... with 6,950 more rows
> tbl_df(unemp)
# A tibble: 6,960 × 7
County Year Month laborforce emplab unemp unemprate
* <chr> <int> <chr> <int> <int> <int> <dbl>
1 Alameda 2007 Jan 743100 708300 34800 4.7
2 Alameda 2007 Feb 744800 711000 33800 4.5
3 Alameda 2007 Mar 746600 713200 33300 4.5
4 Alameda 2007 Apr 738200 705800 32400 4.4
5 Alameda 2007 May 739100 707300 31800 4.3
6 Alameda 2007 Jun 744900 709100 35800 4.8
7 Alameda 2007 Jul 749600 710900 38700 5.2
8 Alameda 2007 Aug 746700 709600 37000 5.0
9 Alameda 2007 Sep 748200 712100 36000 4.8
10 Alameda 2007 Oct 749000 713000 36100 4.8
# ... with 6,950 more rows
我首先想到的是我应该将所有factor
列更改为character
列。
unemp[sapply(unemp, is.factor)] <- lapply(unemp[sapply(unemp, is.factor)], as.character)
claimants[sapply(claimants, is.factor)] <- lapply(claimants[sapply(claimants, is.factor)], as.character)
m <-merge(unemp, claimants, by = c("County", "Month", "Year"))
dim(m)
[1] 0 10
在dim(m)
的输出中,结果数据帧中有0行。所有6960行应该唯一匹配。
要验证两个数据框是否有3列'County','Month'和'Year'的唯一组合,我会重新排序并重新排列数据框中的这些列,如下所示:
a <- claimants[ order(claimants[,"County"], claimants[,"Month"], claimants[,"Year"]), ]
b <- unemp[ order(unemp[,"County"], unemp[,"Month"], unemp[,"Year"]), ]
b[2:4] <- b[c(2,4,3)]
a[2:4] %in% b[2:4]
[1] TRUE TRUE TRUE
最后一个输出确认所有'County','Month'和'Year'列在这两个数据帧中相互匹配。
我已尝试查看merge
的文档但无法收集到哪里出错,我还尝试了inner_join
中的dplyr
函数:
> m <- inner_join(unemp[2:8], claimants[2:5])
Joining, by = c("County", "Year", "Month")
> dim(m)
[1] 0 8
我遗漏了一些东西而且不知道是什么,非常感谢理解这一点的帮助,我知道我不应该通过三列重新排列行来运行merge
R应该识别匹配的行和合并不匹配的列。
答案 0 :(得分:2)
申请人df的所有县都是大写的,非法的df是小写的。
我在读取数据时使用了选项(stringsAsFactors = FALSE)。一些建议在两者中都删除了X列,它似乎没用。
#test lists--li is the orignal one provided by Button
li = [0, 2, [[2, 3], 8, 100, None, [[None]]], -2]
li1 = [-100, -100, [[[None,None]]]]
li2 = [[[[[None,None,1,2,3]]]], 6, 0, 0, 0]
li3 = [None, [None], 56, 78, None]
li4 = [[[[[None,1,2,3]]]], 6, 0, 0, 0]
#solution is theta(n) or more specifically O(n)
#which is the best case solution since we must
#loop the entire list
def flatten(li):
i = 0
while i < len(li):
#only execute if the element is a list
while isinstance(li[i], list):
#taking the element at index i and sets it as the
#i'th part of the list. so if l[i] contains a list
#it is then unrolled or 'unlisted'
li[i:i + 1] = li[i]
i += 1
#for li: for some reason the 2nd None at
#index 7 is an int, probably because there
#might've been an int at that index before manipulation?
#for li1: the 2nd None or element at index 3
#is of class 'NoneType' but the removal is not
#occuring..
for element in li:
if element is None:
li.remove(element)
#conclusion: there is always one None remaining if
#there is more than one None to begin with..
return li
def main():
flatten(li)
print(li)
flatten(li1)
print(li1)
flatten(li2)
print(li2)
flatten(li3)
print(li3)
flatten(li4)
print(li4)
if __name__ == '__main__':
main()