按集分组。要对具有类似标头python的文件进行分组

时间:2016-08-30 07:28:29

标签: python python-2.7

我从一个目录结构派生了一个输出,它有很多csv文件。这些文件的标题是手动创建并随机放置的。我必须将所有那些具有相似标题的文件放在一起。

/A/B/C/D~b1.csv.0   Delim:,
"First Name"    "Last Name" Company EMAIL   Phone   Fax "SIC CODE"
/A/B/C/D~b2.csv.0   Delim:,
"First Name"    "Last Name" Phone   Fax "SIC CODE"  Company EMAIL
/A/B/C/D~b3.csv.0   Delim:,
"First Name"    "Last Name" Company EMAIL   Fax "SIC CODE"  Phone
/A/B/C/D~b4.csv.0   Delim:,
"First Name"    "Last Name" Company EMAIL   Phone   Fax "SIC CODE"
/A/B/C/D~c1.csv.0   Delim:,
"Business Type" "Main Markets"  Establised  "No Of Employees"   Category    "Company Name"  "Contact Person"    Designation Address "Pin Code"  "Telephone no"  "Fax No"    Country Website Email
/A/B/C/D~c2.csv.0   Delim:,
"Business Type" "Main Markets"  Establised  "No Of Employees"   Category    "Company Name"  "Contact Person"    Designation Address "Pin Code"  "Telephone no"  "Fax No"    Country Website Email
/A/B/C/D~c3.csv.0   Delim:,
"Business Type" "Main Markets"  Country Website Email   Establised  "No Of Employees"   Category    "Company Name"  "Contact Person"    Designation Address "Pin Code"  "Telephone no"  "Fax No"

第一部分/A/B/C/D是目录结构,后跟一个~,后跟解析文件所需的分隔符Delim:,。下一行是从文件"First Name" "Last Name" Company EMAIL Phone Fax "SIC CODE"

中提取的标题

我尝试创建一个示例代码,将类似的标题组合在一起,如下所示,我知道这些代码不起作用:

>>> li = [('abc', set(['a', 'c', 'b'])), ('def', set(['e', 'd', 'f'])), ('ghi', set(['i', 'h', 'g'])), ('jkl', set(['k', 'j', 'l'])), ('mno', set(['m', 'o', 'n'])), ('pqr', set(['q', 'p', 'r'])), ('stu', set(['s', 'u', 't'])), ('vwx', set(['x', 'w', 'v'])), ('ABC', set(['a', 'c', 'b'])), ('DEF', set(['e', 'd', 'f'])), ('GHI', set(['i', 'h', 'g'])), ('JKL', set(['k', 'j', 'l'])), ('MNO', set(['m', 'o', 'n'])), ('PQR', set(['q', 'p', 'r'])), ('STU', set(['s', 'u', 't'])), ('VWX', set(['x', 'w', 'v']))]
>>> for key, group in groupby(li, lambda x: x[1]):
...     for l in group:
...         print "%s %s." % (l[1], l[0])

如何将这些集合组合在一起。 任何有助于确定如何对类似头文件进行分组的帮助表示赞赏。

1 个答案:

答案 0 :(得分:1)

以下方法通过获取每个CSV标头并将其转换为列条目列表来工作。然后对它们进行排序并转换为元组。然后将其用作默认字典的键。每个匹配的条目都与原始列排序一起附加到列表中。

结果是一个字典,它将包含相同列条目的CSV文件组合在一起。如果列条目不区分大小写,则在用作键之前,可以将元组条目转换为小写。

||

这会根据您的数据为您提供以下分组:

from collections import defaultdict
import csv
from StringIO import StringIO

csv_groups = defaultdict(list)

entries = [
    ["/A/B/C/D~b1.csv.0", "Delim:,", '"First Name"    "Last Name" Company EMAIL   Phone   Fax "SIC CODE"'],
    ["/A/B/C/D~b2.csv.0", "Delim:,", '"First Name"    "Last Name" Phone   Fax "SIC CODE"  Company EMAIL'],
    ["/A/B/C/D~b3.csv.0", "Delim:,", '"First Name"    "Last Name" Company EMAIL   Fax "SIC CODE"  Phone'],
    ["/A/B/C/D~b4.csv.0", "Delim:,", '"First Name"    "Last Name" Company EMAIL   Phone   Fax "SIC CODE"'],
    ["/A/B/C/D~c1.csv.0", "Delim:,", '"Business Type" "Main Markets"  Establised  "No Of Employees"   Category    "Company Name"  "Contact Person"    Designation Address "Pin Code"  "Telephone no"  "Fax No"    Country Website Email'],
    ["/A/B/C/D~c2.csv.0", "Delim:,", '"Business Type" "Main Markets"  Establised  "No Of Employees"   Category    "Company Name"  "Contact Person"    Designation Address "Pin Code"  "Telephone no"  "Fax No"    Country Website Email'],
    ["/A/B/C/D~c3.csv.0", "Delim:,", '"Business Type" "Main Markets"  Country Website Email   Establised  "No Of Employees"   Category    "Company Name"  "Contact Person"    Designation Address "Pin Code"  "Telephone no"  "Fax No"']
]

for folder, delim, header in entries:
    cols = tuple(sorted(list(csv.reader(StringIO(header), delimiter=' ', skipinitialspace=True))[0]))
    csv_groups[cols].append((folder, header))

for csv_type, folders in csv_groups.iteritems():
    print csv_type

    for folder in folders:
        print "  ", folder