Question

我有一个包含交易的文件。每笔交易都包含个人标识符（有时/经常丢失）和交易数据。人员标识符是fname，lname，电话，电子邮件和社会保险号。我想将每个交易链接到一个唯一的人。根据业务规则，如果fname和lname相同且其他3个人识别符中的至少一个相同，则我将两个事务属于同一个人。结果，我需要有两个数据框（最终是两个csv文件）：一个包含唯一的人员，一个初始数据的副本，并带有一个附加的人员ID列。

我写的代码非常适合解决问题。除了当fils变得很长时（我说的是成千上万的行），它都会卡住。我几乎可以肯定我的代码没有经过优化，我想我可以使用诸如groupby（）或unique（）之类的聚集函数找到更好的方法，我认为它们的速度要快得多。但是我不知道怎么做。

import pandas as pd
workDir=r"D:\fichiers\perso\perso\python\unicity\\"


sourceFile='rawdata.csv'
inFrame=pd.read_csv(workDir+sourceFile, sep=";",encoding='ISO-8859-1')
personFrame=pd.DataFrame(columns=('id','fname','lname','email', 'phone','social security number'))
outFrame=pd.DataFrame(columns=inFrame.columns)
idPerson=0
#print(inFrame)


def samePerson(p1, p2):
  response=0
  if p1['fname']==p2['fname'] and p1['lname']==p2['lname']:
      if p1['email']==p2['email'] or p1['phone']==p2['phone'] or p1['social security number']==p2['social security number']:
        response=1
  return(response)

def completePerson(old, new):
    #complete with new line missing data in ols version of the person
    for theColumn in ('fname','lname','email', 'phone','social security number'):
        if pd.isnull(old[theColumn]) :
            old [theColumn]=new[theColumn]
    return(old)

def processLine(theLine):
  global personFrame
  global idPerson
  global outFrame
  theFlag=0
  for indexPerson, thePerson in personFrame.iterrows(): 
      if theFlag==0:
          if samePerson(theLine,thePerson):
              theLine['idPerson']=thePerson.idPerson
              personFrame.loc[indexPerson]=completePerson(thePerson, theLine)
              theFlag=1
  if theFlag==0:
      theLine['idPerson']=idPerson
      idPerson=idPerson+1
      personFrame=personFrame.append(theLine)
  outFrame=outFrame.append(theLine)


def processdf():
    inFrame.apply(processLine, axis=1)
    with open(workDir+'persons.csv','w', encoding='ISO-8859-1') as f:
        personFrame.to_csv(f, index='false')
    with open(workDir+'transactionss.csv','w', encoding='ISO-8859-1') as f:
        outFrame.to_csv(f, index='false')

processdf()

优化代码以按不重复人员分组

0 个答案: