如何按日期将多个数据库选择合并到单个数据集

时间:2019-06-27 00:15:07

标签: python pandas

我有一个数据库,可以在其中添加每次迭代的新数据,并尝试将它们合并到datetime列中。

我正在使用这部分代码:

# Iterate by days

    for row in rows:
        i += 1;
        df_name = f"{row[0]}_{row[1]}";
        print(f"Getting {df_name} {i}/{len(rows)}{spaces}", end="\r");

        if (predictionPoint == row[0]):
            currentDf = pd.read_sql(f"SELECT updated_at, c as '{df_name}_c', "
                                f"v as '{df_name}_v', o FROM commons "
                                f"WHERE cid LIKE '{predictionMeasure}%' AND s = '{row[0]}' AND cid = '{row[1]}' "
                                "ORDER BY updated_at DESC", con = sqlite);
        else:
            currentDf = pd.read_sql(f"SELECT updated_at, c as '{df_name}_c', "
                                f"v as '{df_name}_v' FROM commons "
                                f"WHERE cid LIKE '{predictionMeasure}%' AND s = '{row[0]}' AND cid = '{row[1]}' "
                                "ORDER BY updated_at DESC", con = sqlite);

        currentDf["updated_at"] = currentDf["updated_at"].apply(convertDatetime);

        if (df.empty == False):
            df = pd.merge(left = df, right = currentDf, on = "updated_at", how = "inner");
        else:
            df = currentDf;

    if not os.path.exists(f"{dirName}/{datasetFilename}"):
        df.to_csv(f"{dirName}/{datasetFilename}", encoding = "utf-8", index = False);
    else:
        tempDf = pd.read_csv(f"{dirName}/{datasetFilename}", parse_dates = ["updated_at"]);
        df = pd.concat([tempDf, df], axis = 0, sort = False);
        df.to_csv(f"{dirName}/{datasetFilename}", encoding = "utf-8", index = False);

    print(f"Dataset created {a}/{len(archives)}{spaces}");


df = pd.read_csv(f"{dirName}/{datasetFilename}", parse_dates = ["updated_at"]);
df = df.set_index("updated_at", drop = False);

print("Sorting, filling N/A, cleaning...");
df = df.sort_index(ascending = False);

df = df.fillna(method = "ffill").fillna(method = "bfill");

我在此代码的27个字符串处遇到错误,它返回了一个未合并的数组,该数组具有重复的update_at列或削减的数据帧,但我希望:

updated_at one two three four
2019-06-02 23:59:45  1  2  3  4
2019-06-02 23:59:30  2  3  4  5
2019-06-02 23:59:15  3  4  5  6
2019-06-02 23:59:00  4  5  6  7
2019-06-02 23:58:45  5  6  7  8

由于通过合并添加的数据,因此没有重复的updated_at,并且没有间隙。 我已经尝试使用连接和其他类型的合并...

0 个答案:

没有答案