我在多个文件夹中有.csv
个文件,如下所示:
File1中
Count 2002_Crop_1 2002_Crop_2 Ecoregion
20 Corn Soy 46
15 Barley Oats 46
文件2
Count 2003_Crop_1 2003_Crop_2 Ecoregion
24 Corn Soy 46
18 Barley Oats 46
为每个文件夹我要合并其中的所有文件。
我想要的输出是这样的:
Crop_1 Crop_2 2002_Count 2003_Count Ecoregion
Corn Soy 20 24 46
Barley Oats 15 18 46
实际上,每个文件夹中有10个文件,而不仅仅是2个,需要合并。
我现在使用此代码:
import pandas as pd, os
#pathway to all the folders
folders=r'G:\Stefano\CDL_Trajectory\combined_eco_folders'
for folder in os.listdir(folders):
for f in os.listdir(os.path.join(folders,folder)):
dfs=pd.read_csv(os.path.join(folders,folder,f)) #turn each file from each folder into a dataframe
df = reduce(lambda left,right: pd.merge(left,right,on=[dfs[dfs.columns[1]], dfs[dfs.columns[2]]],how='outer'),dfs) #merge all the dataframes based on column location
但是这会返回:
TypeError: string indices must be integers, not Series
答案 0 :(得分:2)
使用public async Task WriteNamesToConsoleAsync(string connectionString, CancellationToken token = default(CancellationToken))
{
using (var ctx = new DataContext(connectionString))
{
var query = from item in Products where item.Price > 3 select item.Name;
var result = await ExecuteAsync(query, ctx, token);
foreach (var name in result)
{
Console.WriteLine(name);
}
}
}
与traverse a directory at a fixed depth。
如果您可以提供帮助,请尽量避免反复拨打glob.glob
。每次调用pd.merge
都会创建一个新的DataFrame。因此,每个中间结果中的所有数据都必须复制到新的DataFrame中。在循环中执行此操作会导致quadratic copying,这对性能不利。
如果您要更改某些列名称争论,例如
pd.merge
到
['Count', '2002_Crop_1', '2002_Crop_2', 'Ecoregion']
然后您可以使用['2002_Count', 'Crop_1', 'Crop_2', 'Ecoregion']
作为每个DataFrame的索引,并将所有DataFrames 与一个调用合并到['Crop_1', 'Crop_2', 'Ecoregion']
。
pd.concat
产量
import pandas as pd
import glob
folders=r'G:\Stefano\CDL_Trajectory\combined_eco_folders'
dfs = []
for filename in glob.glob(os.path.join(folders, *['*']*2)):
df = pd.read_csv(filename, sep='\s+')
columns = [col.split('_', 1) for col in df.columns]
prefix = next(col[0] for col in columns if len(col) > 1)
columns = [col[1] if len(col) > 1 else col[0] for col in columns]
df.columns = columns
df = df.set_index([col for col in df.columns if col != 'Count'])
df = df.rename(columns={'Count':'{}_Count'.format(prefix)})
dfs.append(df)
result = pd.concat(dfs, axis=1)
result = result.sortlevel(axis=1)
result = result.reset_index()
print(result)