我有多个* .csv文件,我正在收集这些文件以进行数据分析。
import csv
import glob
import os
import pandas as pd
### Tells python where to look for the *.csv files we want to combine.
mycsvdir1 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\LWIR'
mycsvdir2 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\Manta01'
mycsvdir3 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\SWIR'
mycsvdir4 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\LWIR2'
mycsvdir5 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\Manta012'
mycsvdir6 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\SWIR2'
#### Creates lists of all *.csv files to be combined
thelist = glob.glob(os.path.join(mycsvdir1,'*.csv')) + \
glob.glob(os.path.join(mycsvdir2,'*.csv')) + \
glob.glob(os.path.join(mycsvdir3,'*.csv')) + \
glob.glob(os.path.join(mycsvdir4,'*.csv')) + \
glob.glob(os.path.join(mycsvdir5,'*.csv')) + \
glob.glob(os.path.join(mycsvdir6,'*.csv'))
#### Reads each *.csv file with a standard header row for each dataframe
#### so they can be concatenated later
dataframe = []
for csvfile in thelist:
df = pd.read_csv(csvfile,names=['a','b','c','d','e',\
'f','g','h','i','j',\
'k','l','m','n','o',\
'p','q','r','s'], header=0)
dataframe.append(df)
#### Takes the individual dataframes and concatenates them into one large *.csv
combined = pd.concat(dataframe, ignore_index = True)
combined.to_csv('combined.csv', index = False)
这按预期工作,但是我需要能够跟踪每一行的来源。在单独的* .csv文件中,第一行的每一行都包含1、2、3或4,但是我想在L,M或H的后面加上第一,第二列,具体取决于*的子目录。 csv文件来自。因此,在组合文件中,每个数据行在第一列中将具有L1,L2,L3,L4,M1,M2,M3,M4,H1,H2,H3或H4中的一个。
我过去所做的是将读取命令按子目录分开,然后进行相应的编辑。有没有办法结合我的组合读取命令即时执行此操作,还是将读取命令分开是最佳策略?
编辑:
这是我根据第一个答案得出的结论:
import csv
import glob
import os
import pandas as pd
### Tells python where to look for the *.csv files we want to combine.
mycsvdir1 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\LWIR'
mycsvdir2 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\Manta01'
mycsvdir3 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\SWIR'
mycsvdir4 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\LWIR2'
mycsvdir5 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\Manta012'
mycsvdir6 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\SWIR2'
alldirs = pd.DataFrame({
'letter': ['L', 'M', 'H','L', 'M', 'H'], # duplicates are OK
'csv': [glob.glob(os.path.join(d, '*.csv')) for d in [mycsvdir1, \
mycsvdir2, mycsvdir3, mycsvdir4, mycsvdir5, mycsvdir6]]
})
# build the list of letters and CSV files
letters = np.repeat(alldirs['letter'], alldirs['csv'].apply(len))
thelist = np.concatenate(alldirs['csv'])
### Reads each *.csv file with a standard header row for each dataframe
### so they can be concatenated later
dataframe = []
for letter, csvfile in pd.Series(thelist,letters).iteritems():
df = pd.read_csv(csvfile,names=['a','b','c','d','e',\
'f','g','h','i','j',\
'k','l','m','n','o',\
'p','q','r','s'], header=0)
dataframe.append(df)
### Concatenates dataframes into one large *.csv
combined = pd.concat(dataframe, ignore_index = True)
combined.to_csv('combined.csv', index = False)
但是输出没有改变。每行的第一列仍显示1,2,3或4。我认为问题出在我的pd.read_csv调用中,但是我不确定如何解决。
答案 0 :(得分:1)
您可以使用DataFrame本身执行字母CSV文件映射:
alldirs = pd.DataFrame({
'letter': ['L', 'M', 'L'], # duplicates are OK
'csv': [glob.glob(os.path.join(d, '*.csv')) for d in [mycsvdir1, mycsvdir2, mycsvdir3]]
})
# build the list of letters and CSV files
letters = np.repeat(alldirs['letter'], alldirs['csv'].apply(len))
thelist = np.concatenate(alldirs['csv'])
# read each CSV file
for letter, csvfile in pd.Series(thelist, letters).iteritems():
df = pd.read_csv(...)
df['a'] = letter + df['a').str
# if pandas report an error: cannot add string and int together, use
# df['a'] = letter + df['a'].astype(str).str