使用python和pandas合并csv文件(重叠行)

时间:2016-03-13 21:54:45

标签: python csv pandas

我尝试在一个csv文件中更新库存数据,在另一个csv文件中使用新行。由于我检索此数据的方式,行部分重叠。基本库存文件包含(简化示例):

Mar 08, 2016    9692.82     9688.47     9785.05     9617.69     95.75M  -0.88%
Mar 07, 2016    9778.93     9764.08     9803.73     9690.00     78.15M  -0.46%
Mar 04, 2016    9824.17     9800.86     9899.11     9742.76     93.45M  0.74%
Mar 03, 2016    9751.92     9807.06     9808.52     9709.68     85.25M  -0.25%
Mar 02, 2016    9776.62     9780.84     9837.11     9695.98     106.45M     0.61%
Mar 01, 2016    9717.16     9482.66     9719.02     9471.09     99.54M  2.34%
Feb 29, 2016    9495.40     9424.93     9498.57     9332.42     93.79M  -0.19%

应使用第二个文件中的数据更新此文件:

Mar 11, 2016    9831.13 9672.05 9833.90 9642.79 118.96M 3.51%
Mar 10, 2016    9498.15 9697.64 9995.84 9498.15 177.50M -2.31%
Mar 09, 2016    9723.09 9700.16 9838.95 9679.19 100.90M 0.31%
Mar 08, 2016    9692.82 9688.47 9785.05 9617.69 95.75M  -0.88%
Mar 07, 2016    9778.93 9764.08 9803.73 9690.00 78.15M  -0.46%

我用来尝试实现更新的代码如下:

existingquotes = pd.read_csv(filenames_quotes[i], parse_dates=[0], infer_datetime_format=True, header=None, delimiter='\t')
newquotes = pd.read_csv(filenames_upd[i], parse_dates=[0], infer_datetime_format=True, header=None, delimiter='\t')
existingquotes.update(newquotes)
mergedquotes=existingquotes
print mergedquotes

输出如下:

           0        1        2        3        4        5       6
0 2016-03-11  9831.13  9672.05  9833.90  9642.79  118.96M   3.51%
1 2016-03-10  9498.15  9697.64  9995.84  9498.15  177.50M  -2.31%
2 2016-03-09  9723.09  9700.16  9838.95  9679.19  100.90M   0.31%
3 2016-03-08  9692.82  9688.47  9785.05  9617.69   95.75M  -0.88%
4 2016-03-07  9778.93  9764.08  9803.73  9690.00   78.15M  -0.46%
5 2016-03-01  9717.16  9482.66  9719.02  9471.09  99.54M    2.34%
6 2016-02-29  9495.40  9424.93  9498.57  9332.42  93.79M   -0.19%

2016-03-01与016-03-07之间存在差距。如果我使用

existingquotes.update(newquotes), overwrite=False)

更新看起来像原始的csv。感谢任何帮助!

2 个答案:

答案 0 :(得分:2)

您可以先将参数<div class="container"> <!-- Section 1 Starts --> <div id="section1" class="row"> <div class="col-lg-12 text-center"> <h2>Contact Me</h2> </div> </div> <!-- Section 2 Starts --> <div id="section2" class="row"> ... </div> <!-- Section 3 Starts --> <div id="section3" class="row"> ... </div> <!-- Section 4 Starts --> <div id="section4" class="row"> ... </div> </div><!-- /.container --> 添加到read_csv,以便将第一列设置为index_col=[0],然后通过两个索引的联合添加reindex并使用最后一个使用函数combine_firstDatetimeindex NaN的值填充DataFrame

newquotes

如果print existingquotes 1 2 3 4 5 6 0 2016-03-08 9692.82 9688.47 9785.05 9617.69 95.75M -0.88% 2016-03-07 9778.93 9764.08 9803.73 9690.00 78.15M -0.46% 2016-03-04 9824.17 9800.86 9899.11 9742.76 93.45M 0.74% 2016-03-03 9751.92 9807.06 9808.52 9709.68 85.25M -0.25% 2016-03-02 9776.62 9780.84 9837.11 9695.98 106.45M 0.61% 2016-03-01 9717.16 9482.66 9719.02 9471.09 99.54M 2.34% 2016-02-29 9495.40 9424.93 9498.57 9332.42 93.79M -0.19% print newquotes 1 2 3 4 5 6 0 2016-03-11 9831.13 9672.05 9833.90 9642.79 118.96M 3.51% 2016-03-10 9498.15 9697.64 9995.84 9498.15 177.50M -2.31% 2016-03-09 9723.09 9700.16 9838.95 9679.19 100.90M 0.31% 2016-03-08 9692.82 9688.47 9785.05 9617.69 95.75M -0.88% 2016-03-07 9778.93 9764.08 9803.73 9690.00 78.15M -0.46% existingquotes = existingquotes.reindex(existingquotes.index.union(newquotes.index)) print existingquotes 1 2 3 4 5 6 0 2016-02-29 9495.40 9424.93 9498.57 9332.42 93.79M -0.19% 2016-03-01 9717.16 9482.66 9719.02 9471.09 99.54M 2.34% 2016-03-02 9776.62 9780.84 9837.11 9695.98 106.45M 0.61% 2016-03-03 9751.92 9807.06 9808.52 9709.68 85.25M -0.25% 2016-03-04 9824.17 9800.86 9899.11 9742.76 93.45M 0.74% 2016-03-07 9778.93 9764.08 9803.73 9690.00 78.15M -0.46% 2016-03-08 9692.82 9688.47 9785.05 9617.69 95.75M -0.88% 2016-03-09 NaN NaN NaN NaN NaN NaN 2016-03-10 NaN NaN NaN NaN NaN NaN 2016-03-11 NaN NaN NaN NaN NaN NaN 中的重叠值不同,您可以添加:

DataFrames

但是这个样本是相同的,所以可以省略。

existingquotes.loc[existingquotes.index.intersection(newquotes.index),:] = np.nan

相反print existingquotes.combine_first(newquotes) 1 2 3 4 5 6 0 2016-02-29 9495.40 9424.93 9498.57 9332.42 93.79M -0.19% 2016-03-01 9717.16 9482.66 9719.02 9471.09 99.54M 2.34% 2016-03-02 9776.62 9780.84 9837.11 9695.98 106.45M 0.61% 2016-03-03 9751.92 9807.06 9808.52 9709.68 85.25M -0.25% 2016-03-04 9824.17 9800.86 9899.11 9742.76 93.45M 0.74% 2016-03-07 9778.93 9764.08 9803.73 9690.00 78.15M -0.46% 2016-03-08 9692.82 9688.47 9785.05 9617.69 95.75M -0.88% 2016-03-09 9723.09 9700.16 9838.95 9679.19 100.90M 0.31% 2016-03-10 9498.15 9697.64 9995.84 9498.15 177.50M -2.31% 2016-03-11 9831.13 9672.05 9833.90 9642.79 118.96M 3.51% 您可以使用fillna

combine_first

答案 1 :(得分:0)

谢谢大家,它就像一个魅力。最终代码如下所示:

existingquotes = pd.read_csv(filenames_quotes[i], index_col=[0], parse_dates=[0], infer_datetime_format=True, header=None, delimiter='\t')
newquotes = pd.read_csv(filenames_upd[i], index_col=[0], parse_dates=[0], infer_datetime_format=True, header=None, delimiter='\t')

existingquotes =  existingquotes.reindex(existingquotes.index.union(newquotes.index))
existingquotes = existingquotes.fillna(newquotes)

print mergedquotes

并导致预期结果(与jezrael发布的相同)