当我合并两个格式(date,someValue)的CSV文件时,我看到一些重复的记录。
如果我将记录缩减到一半,问题就会消失。但是,如果我将两个文件的大小加倍,它会恶化。感谢任何帮助!
i = pd.DataFrame.from_csv('i.csv')
i = i.reset_index()
e = pd.DataFrame.from_csv('e.csv')
e = e.reset_index()
total_df = pd.merge(i, e, right_index=False, left_index=False,
right_on=['date'], left_on=['date'], how='left')
total_df = total_df.sort(column='date')
(注:11 / 15,11 / 16,12 / 17,12 / 18的双重记录。)
In [7]: total_df
Out[7]:
date Cost netCost
25 2012-11-15 00:00:00 1 2
26 2012-11-15 00:00:00 1 2
31 2012-11-16 00:00:00 1 2
32 2012-11-16 00:00:00 1 2
37 2012-11-17 00:00:00 1 2
2 2012-11-18 00:00:00 1 2
5 2012-11-19 00:00:00 1 2
8 2012-11-20 00:00:00 1 2
11 2012-11-21 00:00:00 1 2
14 2012-11-22 00:00:00 1 2
17 2012-11-23 00:00:00 1 2
20 2012-11-24 00:00:00 1 2
23 2012-11-25 00:00:00 1 2
29 2012-11-26 00:00:00 1 2
35 2012-11-27 00:00:00 1 2
0 2012-11-28 00:00:00 1 2
3 2012-11-29 00:00:00 1 2
6 2012-11-30 00:00:00 1 2
9 2012-12-01 00:00:00 1 2
12 2012-12-02 00:00:00 1 2
15 2012-12-03 00:00:00 1 2
18 2012-12-04 00:00:00 1 2
21 2012-12-05 00:00:00 1 2
24 2012-12-06 00:00:00 1 2
30 2012-12-07 00:00:00 1 2
36 2012-12-08 00:00:00 1 2
1 2012-12-09 00:00:00 2 2
4 2012-12-10 00:00:00 2 2
7 2012-12-11 00:00:00 2 2
10 2012-12-12 00:00:00 2 2
13 2012-12-13 00:00:00 1 2
16 2012-12-14 00:00:00 2 2
19 2012-12-15 00:00:00 2 2
22 2012-12-16 00:00:00 2 2
27 2012-12-17 00:00:00 1 2
28 2012-12-17 00:00:00 1 2
33 2012-12-18 00:00:00 1 2
34 2012-12-18 00:00:00 1 2
date,Cost
2012-11-15 00:00:00,1
2012-11-16 00:00:00,1
2012-11-17 00:00:00,1
2012-11-18 00:00:00,1
2012-11-19 00:00:00,1
2012-11-20 00:00:00,1
2012-11-21 00:00:00,1
2012-11-22 00:00:00,1
2012-11-23 00:00:00,1
2012-11-24 00:00:00,1
2012-11-25 00:00:00,1
2012-11-26 00:00:00,1
2012-11-27 00:00:00,1
2012-11-28 00:00:00,1
2012-11-29 00:00:00,1
2012-11-30 00:00:00,1
2012-12-01 00:00:00,1
2012-12-02 00:00:00,1
2012-12-03 00:00:00,1
2012-12-04 00:00:00,1
2012-12-05 00:00:00,1
2012-12-06 00:00:00,1
2012-12-07 00:00:00,1
2012-12-08 00:00:00,1
2012-12-09 00:00:00,2
2012-12-10 00:00:00,2
2012-12-11 00:00:00,2
2012-12-12 00:00:00,2
2012-12-13 00:00:00,1
2012-12-14 00:00:00,2
2012-12-15 00:00:00,2
2012-12-16 00:00:00,2
2012-12-17 00:00:00,1
2012-12-18 00:00:00,1
date,netCost
2012-11-15 00:00:00,2
2012-11-16 00:00:00,2
2012-11-17 00:00:00,2
2012-11-18 00:00:00,2
2012-11-19 00:00:00,2
2012-11-20 00:00:00,2
2012-11-21 00:00:00,2
2012-11-22 00:00:00,2
2012-11-23 00:00:00,2
2012-11-24 00:00:00,2
2012-11-25 00:00:00,2
2012-11-26 00:00:00,2
2012-11-27 00:00:00,2
2012-11-28 00:00:00,2
2012-11-29 00:00:00,2
2012-11-30 00:00:00,2
2012-12-01 00:00:00,2
2012-12-02 00:00:00,2
2012-12-03 00:00:00,2
2012-12-04 00:00:00,2
2012-12-05 00:00:00,2
2012-12-06 00:00:00,2
2012-12-07 00:00:00,2
2012-12-08 00:00:00,2
2012-12-09 00:00:00,2
2012-12-10 00:00:00,2
2012-12-11 00:00:00,2
2012-12-12 00:00:00,2
2012-12-13 00:00:00,2
2012-12-14 00:00:00,2
2012-12-15 00:00:00,2
2012-12-16 00:00:00,2
2012-12-17 00:00:00,2
2012-12-18 00:00:00,2
答案 0 :(得分:1)
这看起来像是pandas 0.7.3或numpy 1.6的错误。仅当合并的列是日期(内部转换为numpy.datetime64)时才会发生这种情况。我的解决方案是将日期转换为字符串 -
def _DatetimeToString(datetime64):
timestamp = datetime64.astype(long)/1000000000
return datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d')
i = pd.DataFrame.from_csv('i.csv')
i = i.reset_index()
i['date'] = i['date'].map(_DatetimeToString)
e = pd.DataFrame.from_csv('e.csv')
e = e.reset_index()
i['date'] = i['date'].map(_DatetimeToString)
total_df = pd.merge(i, e, right_index=False, left_index=False,
right_on=['date'], left_on=['date'], how='left')
total_df = total_df.sort(column='date')
答案 1 :(得分:1)
这个问题/错误也出现了。我没有合并日期时间系列,但是,我在左数据框中确实有一个日期时间系列。我的解决方案是重复数据删除:
len(pophist)
2347
pop_merged = pd.merge(left=pophist, right=df_labels, how='left',
left_on ='candidate', right_on ='Slug', indicator = True)
pop_merged.shape
3303
pop_merged2 = pop_merged.drop_duplicates() #note dedupping is required due to issue in how pandas handles datetime dtypes on merge.
len(pop_merged2)
2347