我从一个目录结构派生了一个输出,它有很多csv文件。这些文件的标题是手动创建并随机放置的。我必须将所有那些具有相似标题的文件放在一起。
/A/B/C/D~b1.csv.0 Delim:,
"First Name" "Last Name" Company EMAIL Phone Fax "SIC CODE"
/A/B/C/D~b2.csv.0 Delim:,
"First Name" "Last Name" Phone Fax "SIC CODE" Company EMAIL
/A/B/C/D~b3.csv.0 Delim:,
"First Name" "Last Name" Company EMAIL Fax "SIC CODE" Phone
/A/B/C/D~b4.csv.0 Delim:,
"First Name" "Last Name" Company EMAIL Phone Fax "SIC CODE"
/A/B/C/D~c1.csv.0 Delim:,
"Business Type" "Main Markets" Establised "No Of Employees" Category "Company Name" "Contact Person" Designation Address "Pin Code" "Telephone no" "Fax No" Country Website Email
/A/B/C/D~c2.csv.0 Delim:,
"Business Type" "Main Markets" Establised "No Of Employees" Category "Company Name" "Contact Person" Designation Address "Pin Code" "Telephone no" "Fax No" Country Website Email
/A/B/C/D~c3.csv.0 Delim:,
"Business Type" "Main Markets" Country Website Email Establised "No Of Employees" Category "Company Name" "Contact Person" Designation Address "Pin Code" "Telephone no" "Fax No"
第一部分/A/B/C/D
是目录结构,后跟一个~
,后跟解析文件所需的分隔符Delim:,
。下一行是从文件"First Name" "Last Name" Company EMAIL Phone Fax "SIC CODE"
我尝试创建一个示例代码,将类似的标题组合在一起,如下所示,我知道这些代码不起作用:
>>> li = [('abc', set(['a', 'c', 'b'])), ('def', set(['e', 'd', 'f'])), ('ghi', set(['i', 'h', 'g'])), ('jkl', set(['k', 'j', 'l'])), ('mno', set(['m', 'o', 'n'])), ('pqr', set(['q', 'p', 'r'])), ('stu', set(['s', 'u', 't'])), ('vwx', set(['x', 'w', 'v'])), ('ABC', set(['a', 'c', 'b'])), ('DEF', set(['e', 'd', 'f'])), ('GHI', set(['i', 'h', 'g'])), ('JKL', set(['k', 'j', 'l'])), ('MNO', set(['m', 'o', 'n'])), ('PQR', set(['q', 'p', 'r'])), ('STU', set(['s', 'u', 't'])), ('VWX', set(['x', 'w', 'v']))]
>>> for key, group in groupby(li, lambda x: x[1]):
... for l in group:
... print "%s %s." % (l[1], l[0])
如何将这些集合组合在一起。 任何有助于确定如何对类似头文件进行分组的帮助表示赞赏。
答案 0 :(得分:1)
以下方法通过获取每个CSV标头并将其转换为列条目列表来工作。然后对它们进行排序并转换为元组。然后将其用作默认字典的键。每个匹配的条目都与原始列排序一起附加到列表中。
结果是一个字典,它将包含相同列条目的CSV文件组合在一起。如果列条目不区分大小写,则在用作键之前,可以将元组条目转换为小写。
||
这会根据您的数据为您提供以下分组:
from collections import defaultdict
import csv
from StringIO import StringIO
csv_groups = defaultdict(list)
entries = [
["/A/B/C/D~b1.csv.0", "Delim:,", '"First Name" "Last Name" Company EMAIL Phone Fax "SIC CODE"'],
["/A/B/C/D~b2.csv.0", "Delim:,", '"First Name" "Last Name" Phone Fax "SIC CODE" Company EMAIL'],
["/A/B/C/D~b3.csv.0", "Delim:,", '"First Name" "Last Name" Company EMAIL Fax "SIC CODE" Phone'],
["/A/B/C/D~b4.csv.0", "Delim:,", '"First Name" "Last Name" Company EMAIL Phone Fax "SIC CODE"'],
["/A/B/C/D~c1.csv.0", "Delim:,", '"Business Type" "Main Markets" Establised "No Of Employees" Category "Company Name" "Contact Person" Designation Address "Pin Code" "Telephone no" "Fax No" Country Website Email'],
["/A/B/C/D~c2.csv.0", "Delim:,", '"Business Type" "Main Markets" Establised "No Of Employees" Category "Company Name" "Contact Person" Designation Address "Pin Code" "Telephone no" "Fax No" Country Website Email'],
["/A/B/C/D~c3.csv.0", "Delim:,", '"Business Type" "Main Markets" Country Website Email Establised "No Of Employees" Category "Company Name" "Contact Person" Designation Address "Pin Code" "Telephone no" "Fax No"']
]
for folder, delim, header in entries:
cols = tuple(sorted(list(csv.reader(StringIO(header), delimiter=' ', skipinitialspace=True))[0]))
csv_groups[cols].append((folder, header))
for csv_type, folders in csv_groups.iteritems():
print csv_type
for folder in folders:
print " ", folder