如何基于它们的公共列合并多个(超过2个)csv文件?

时间:2019-03-23 14:29:24

标签: python python-3.x csv

现在我有50个CSV文件,其列如下所示:

gdp1950.csv

id,gdp
a,100
b,200
c,300

gdp1951.csv

id,gdp
a,400
b,500
c,600

...

gdp2000.csv

id,gdp
a,700
b,800
c,900

我要做的是像这样合并上面的csv文件:

id,gdp1950,gdp1951,...,gdp2000
a,100,400,...,700
b,200,500,...,800
c,300,600,...,900

该任务必须由Python在jupyter notebook中完成。有什么想法吗?

2 个答案:

答案 0 :(得分:2)

您可以使用名为pandas的库,该库非常适合此任务:

from functools import reduce
dfs = [pd.read_csv(f"gdp{i}.csv") for i in range(1950, 2001)]
df = reduce(lambda df1, df2: pd.merge(left=df1, right=df2, on=["id"], how="inner"), dfs)

答案 1 :(得分:0)

您可以使用香草python解决此问题,无需第三方库或模块:

outputDict = {"id" : []}
for i in range(1950, 2001):
    outputDict["id"].append(f"gdp{i}")
    with open(f"gdp{i}.csv", "r") as file:
        file.readline()    # We don't need that line
        for line in file:
            key, value = line.rstrip("\n").split(",")
            if key in outputDict:
                outputDict[key].append(value)
            else:
                outputDict[key] = [value]

with open("gdpTotal.csv", "w") as output:
     output.write("\n".join(",".join((k, *[i for i in v])) for k, v in outputDict.items()))    # Convert the dictionary of lists into a suitable string for file writing

最后一行"\n".join(",".join((k, *[i for i in v])) for k, v in outputDict.items())等同于(结果相同,但过程不同)

for k, v in outputDict.items():
    output.write(f"{k},{','.join(v)}\n")

此外,您可以使用collections.defaultdict删除if语句。此外,它的速度略快。

outputDict = defaultdict(list)
for i in range(1950, 2001):
    outputDict["id"].append(f"gdp{i}")
    with open(f"gdp{i}.csv", "r") as file:
        file.readline()
        for line in file:
            key, value = line.rstrip("\n").split(",")
            outputDict[key].append(value)

with open("gdpTotal.csv", "w") as output:
     output.write("\n".join(",".join((k, *[i for i in v])) for k, v in outputDict.items()))

使用timeit.timeit(带有参数number = 100),第一个代码需要0.825195171秒(第二个代码0.8229198819999999)。 而是使用熊猫:

from functools import reduce
import pandas as pd
dfs = [pd.read_csv(f"gdp{i}.csv") for i in range(1950, 2001)]
df = reduce(lambda df1, df2: pd.merge(left=df1, right=df2, on=["id"], how="inner"), dfs)
df.to_csv("gdpTotal.csv")

花费32.095738075999996秒。可能需要较少的行,但速度要慢得多。