合并两个CSV列并匹配

时间:2018-01-03 14:52:54

标签: python csv

我有一个包含三个主要栏目的CSV,我需要注入。

其中一个是名为“Material”的产品名称 其中一个是名为“Serial”的组名 决赛是“相关”,将Martial与Serial

相匹配

目前,CSV将如下所示: (例如,有更多字段和不同数据)

Martial | Serial     | Related
ExOne   | GroupOne   | 
ExTwo   | GroupOne   |
ExThree | GroupOne   |
ExFour  | GroupTwo   |
ExFive  | GroupTwo   |
ExSix   | GroupThree |

我需要通过序列将每个军事匹配到每个军事但限于五个(并以“///”分隔

示例结果应如下所示:

Martial | Serial     | Related
ExOne   | GroupOne   | ExOne///ExTwo///ExThree
ExTwo   | GroupOne   | ExOne///ExTwo///ExThree
ExThree | GroupOne   | ExOne///ExTwo///ExThree
ExFour  | GroupTwo   | ExFour///ExFive 
ExFive  | GroupTwo   | ExFour///ExFive
ExSix   | GroupThree | ExSix   

这是我在Python上的第一次尝试,我现在尝试过的代码只涉及我所说的内容。我正在构建代码的方式是一点一滴,第一位(目标)是匹配串行组并列出所有军事项目,例如:

GroupOne
ExOne
ExTwo
ExThree

GroupTwo
ExFour
ExFive

GroupSix
ExSix

然后从那里我可以制作案例并按因子组合(如果超过5等)

import csv
import sys  

with open('EGLOINDOORCSV.csv') as csvfile:
    readCSV = csv.reader(csvfile, delimiter=',')
    Materials = []
    Serials = []
    for row in readCSV:
        Material = row[0]
        Serial = row[4]

        Materials.append(Material)
        Serials.append(Serial)

        if Serial == Serial:
            print(Serial)
            print(Material, end = "///")
            print("\n")
            break 

    print("Done")

3 个答案:

答案 0 :(得分:2)

首先让我们重新创建一个示例文件:

data = '''\
Martial|Serial|Related
ExOne|GroupOne|
ExTwo|GroupOne|
ExThree|GroupOne|
ExFour|GroupTwo|
ExFive|GroupTwo|
ExSix|GroupThree|'''

with open('test.csv', 'w') as f:
    f.write(data)

现在使用Pandas的实际代码(Pandas与Anaconda软件包一起提供)。使用pip install pandas在没有anaconda的情况下安装它。

import pandas as pd

df = pd.read_csv('test.csv', sep='|')

df['Related'] = df['Serial'].map(df.groupby('Serial')['Martial']
                .apply(lambda x: '///'.join(x)))

df.to_csv('output.csv', index=False)

返回:

   Martial      Serial                  Related
0    ExOne    GroupOne  ExOne///ExTwo///ExThree
1    ExTwo    GroupOne  ExOne///ExTwo///ExThree
2  ExThree    GroupOne  ExOne///ExTwo///ExThree
3   ExFour    GroupTwo          ExFour///ExFive
4   ExFive    GroupTwo          ExFour///ExFive
5    ExSix  GroupThree                    ExSix

答案 1 :(得分:1)

这是使用收件箱itertools的方法,您无需安装任何额外的包。然后,这就是如何使用字典和列表理解 pythonistic方式编写它。

一步一步的方法:

#reading all file at once
import csv
with open('EGLOINDOORCSV.csv') as csvfile:
   l=[r for r in csv.reader(csvfile, delimiter=r',')][1:] #skip header

#itertools requires sorted data. Sorting by second field.
key=lambda x: x[1]
l = sorted( l, key = key)

#grouping to an aux dictionary
from itertools import groupby
d={ k: "///".join( x[0] for x in g) for k,g in groupby( l, key) }

#updating third column from aux dictionary
for x in l: 
    x[2]=d[x[1]]

Etvoilà!

#this is the content of l, ready to go back to a new csv
[
 ['ExOne', 'GroupOne', 'ExOne///ExTwo///ExThree'],
 ['ExTwo', 'GroupOne', 'ExOne///ExTwo///ExThree'],
 ['ExThree', 'GroupOne', 'ExOne///ExTwo///ExThree'],
 ['ExSix', 'GroupThree', 'ExSix'],
 ['ExFour', 'GroupTwo', 'ExFour///ExFive'],
 ['ExFive', 'GroupTwo', 'ExFour///ExFive'],
]

免责声明这是一个完整的解决方案,但请记住,pandas是您处理数据的朋友,请记住安装它并转移到如果您需要管理大量数据,请使用pandas解决方案。

原始数据

$cat EGLOINDOORCSV.csv 
Martial,Serial,Related
ExOne,GroupOne,
ExTwo,GroupOne,
ExThree,GroupOne,
ExFour,GroupTwo,
ExFive,GroupTwo,
ExSix,GroupThree,

答案 2 :(得分:1)

我的方法是两次读取CSV。在第一遍中,我收集相关信息,在第二遍中,输出:

import csv

# Pass 1: gather related materials
with open('EGLOINDOORCSV.csv') as csvfile:
    reader = csv.reader(csvfile)
    related = {}
    for row in reader:
        material = row[0]
        serial = row[1]
        related.setdefault(serial, set()).add(material)
# print(related)  # for debugging

# Pass 2: print
with open('EGLOINDOORCSV.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        material = row[0]
        serial = row[1]
        print('%s | %s | %s' % (material, serial, '///'.join(sorted(related[serial]))))

输出:

ExOne | GroupOne | ExOne///ExThree///ExTwo
ExTwo | GroupOne | ExOne///ExThree///ExTwo
ExThree | GroupOne | ExOne///ExThree///ExTwo
ExFour | GroupTwo | ExFive///ExFour
ExFive | GroupTwo | ExFive///ExFour
ExSix | GroupThree | ExSix

注释

  • 我假设你的CSV文件没有标题。如果你这样做,你将需要跳过它:

    reader = csv.reader(csvfile)
    next(reader)  # Skip the header, then move on
    
  • 根据您提供的CSV,我将row[0]分配给material,请调整索引编号以匹配您的文件

关于related字典

这本字典是我保持关系的地方,它看起来像这样:

{
    "GroupTwo": set(["ExFour", "ExFive"]),
    "GroupOne": set(["ExOne", "ExThree", "ExTwo"]),
    "GroupThree": set(["ExSix"])
}

在我的代码中,声明:

    related.setdefault(serial, set()).add(material)

是:

的简写
    if serial not in related:
        related[serial] = set()
    related[serial].add(material)