Question

我有CSV文件中的数据。其中一列列出了人员名称，该列中的所有行都提供了有关该人员的一些描述性属性，直到显示下一个人名。我可以通过LTYPE列判断该行何时具有名称或属性，该列中的N表示该行中的NAME值实际上是名称，该列中的A表示NAME列中的数据是属性。属性是编码的，我有600K行的数据。这是一个例子。数据被分组，每个分组的befinning由RID重置为1表示。

{'LTYPE': 'N', 'RID': '1', 'NAME': 'Jason Smith'}
{'LTYPE': 'A', 'RID': '2', 'NAME': 'DA'}
{'LTYPE': 'A', 'RID': '3', 'NAME': 'B'}
{'LTYPE': 'N', 'RID': '4', 'NAME': 'John Smith'}
{'LTYPE': 'A', 'RID': '5', 'NAME': 'BC'}
{'LTYPE': 'A', 'RID': '6', 'NAME': 'CB'}
{'LTYPE': 'A', 'RID': '7', 'NAME': 'DB'}
{'LTYPE': 'A', 'RID': '8', 'NAME': 'DA'}
{'LTYPE': 'N', 'RID': '9', 'NAME': 'Robert Smith'}
{'LTYPE': 'A', 'RID': '10', 'NAME': 'BC'}
{'LTYPE': 'A', 'RID': '11', 'NAME': 'DB'}
{'LTYPE': 'A', 'RID': '12', 'NAME': 'CB'}
{'LTYPE': 'A', 'RID': '13', 'NAME': 'RB'}
{'LTYPE': 'A', 'RID': '14', 'NAME': 'VC'}
{'LTYPE': 'N', 'RID': '15', 'NAME': 'Harvey Smith'}
{'LTYPE': 'A', 'RID': '16', 'NAME': 'SA'}
{'LTYPE': 'A', 'RID': '17', 'NAME': 'AS'}
{'LTYPE': 'N', 'RID': '18', 'NAME': 'Lukas Smith'}
{'LTYPE': 'A', 'RID': '19', 'NAME': 'BC'}
{'LTYPE': 'A', 'RID': '20', 'NAME': 'AS'}

我想创建以下内容：

{'PERSON_ATTRIBUTES': 'DA B ', 'LTYPE': 'N', 'RID': '1', 'PERSON_NAME': 'Jason Smith', 'NAME': 'Jason Smith'}
{'PERSON_ATTRIBUTES': 'DA B ', 'LTYPE': 'A', 'RID': '2', 'PERSON_NAME': 'Jason Smith', 'NAME': 'DA'}
{'PERSON_ATTRIBUTES': 'DA B ', 'LTYPE': 'A', 'RID': '3', 'PERSON_NAME': 'Jason Smith', 'NAME': 'B'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'N', 'RID': '4', 'PERSON_NAME': 'John Smith', 'NAME': 'John Smith'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '5', 'PERSON_NAME': 'John Smith', 'NAME': 'BC'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '6', 'PERSON_NAME': 'John Smith', 'NAME': 'CB'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '7', 'PERSON_NAME': 'John Smith', 'NAME': 'DB'}
{'PERSON_ATTRIBUTES': 'BC CB DB DA ', 'LTYPE': 'A', 'RID': '8', 'PERSON_NAME': 'John Smith', 'NAME': 'DA'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'N', 'RID': '9', 'PERSON_NAME': 'Robert Smith', 'NAME': 'Robert Smith'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '10', 'PERSON_NAME': 'Robert Smith', 'NAME': 'BC'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '11', 'PERSON_NAME': 'Robert Smith', 'NAME': 'DB'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '12', 'PERSON_NAME': 'Robert Smith', 'NAME': 'CB'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '13', 'PERSON_NAME': 'Robert Smith', 'NAME': 'RB'}
{'PERSON_ATTRIBUTES': 'BC DB CB RB VC ', 'LTYPE': 'A', 'RID': '14', 'PERSON_NAME': 'Robert Smith', 'NAME': 'VC'}
{'PERSON_ATTRIBUTES': 'SA AS ', 'LTYPE': 'N', 'RID': '15', 'PERSON_NAME': 'Harvey Smith', 'NAME': 'Harvey Smith'}
{'PERSON_ATTRIBUTES': 'SA AS ', 'LTYPE': 'A', 'RID': '16', 'PERSON_NAME': 'Harvey Smith', 'NAME': 'SA'}
{'PERSON_ATTRIBUTES': 'SA AS ', 'LTYPE': 'A', 'RID': '17', 'PERSON_NAME': 'Harvey Smith', 'NAME': 'AS'}
{'PERSON_ATTRIBUTES': 'BC AS ', 'LTYPE': 'N', 'RID': '18', 'PERSON_NAME': 'Lukas Smith', 'NAME': 'Lukas Smith'}
{'PERSON_ATTRIBUTES': 'BC AS ', 'LTYPE': 'A', 'RID': '19', 'PERSON_NAME': 'Lukas Smith', 'NAME': 'BC'}
{'PERSON_ATTRIBUTES': 'BC AS ', 'LTYPE': 'A', 'RID': '20', 'PERSON_NAME': 'Lukas Smith', 'NAME': 'AS'}

我开始获取LTYPE的指数位置

nameIndex=[]
attributeIndex=[]
for line in thedata:
    if line['LTYPE']=='N':
        nameIndex.append(int(line["RID"])-1)
    if line['LTYPE']=='A':
        attributeIndex.append(int(line["RID"])-1)

因此，我将每个分类为一个列表中的名称的行的列表索引和分类为另一个列表中的属性的每个行的列表索引。然后很容易将名称附加到每个观察，如下所示

for counter, row in enumerate(thedata):
    if counter in nameIndex:
        row['PERSON_NAME']=row['NAME']
        person_NAME=row['NAME']
    if counter not in nameIndex:
        row['PERSON_NAME']=person_NAME

我正在努力确定并为每个人分配属性列表。

首先，我需要结合属于一起的属性，所以我这样做了：

 newAttribute=[]
 for counter, row in enumerate(thedata):
     if counter in attributeIndex:
         tempAttribute=tempAttribute+' '+row['NAME']

     if counter not in attributeIndex:
         if counter==0:
             tempAttribute=""
             pass
         if counter!=0:
             newAttribute.append(tempAttribute.lstrip())
             tempAttribute=""

我的方法的一个问题是，我仍然必须将最后一个组添加到newAttribute列表，因为循环在添加之前完成。因此，要获取分组属性列表，我必须运行

newAttribute.append(tempAttribute)

但即便如此，我似乎找不到一个干净的方法来添加我必须分两步完成的属性。首先，我创建一个字典，其中nameIndex位置作为键，属性作为值

tempDict={}
for each in range(len(nameIndex)):
    tempdict[nameIndex[each]]=newAttribute[each]

我在名称行

上放置属性后循环浏览列表

for counter,row in enumerate(thedata):
    if counter in tempDict:
        thedata[counter]['TA']=tempDict[counter]

然后我再次检查是否存在密钥'TA'并使用存在来设置PERSON_ATTRIBUTE键

for each in thedata:
    if each.has_key('TA'):
        each['PERSON_ATTRIBUTES']=each['TA']
        holdAttribute=each['TA']
    else:
        each['PERSON_ATTRIBUTES']=holdAttribute

必须有一个更清晰的方式来考虑这个，所以我想知道是否有人想指出我可以阅读的一些功能的方向，这将让我清理这个代码。我知道我仍然要放弃'TA'键，但我认为我已经占用了足够的空间。

Answer 1

我建议使用基于itertools.groupby的另一种无索引方法：

import itertools, operator

data = [
{'LTYPE': 'N', 'RID': '1', 'NAME': 'Jason Smith'},
{'LTYPE': 'A', 'RID': '2', 'NAME': 'DA'},
{'LTYPE': 'A', 'RID': '3', 'NAME': 'B'},
{'LTYPE': 'N', 'RID': '4', 'NAME': 'John Smith'},
{'LTYPE': 'A', 'RID': '5', 'NAME': 'BC'},
{'LTYPE': 'A', 'RID': '6', 'NAME': 'CB'},
{'LTYPE': 'A', 'RID': '7', 'NAME': 'DB'},
{'LTYPE': 'A', 'RID': '8', 'NAME': 'DA'},
]

for k, g in itertools.groupby(data, operator.itemgetter('LTYPE')):
  if k=='N':
    person_name_record = next(g)
  else:
    attribute_records = list(g)
    person_attributes = ' '.join(r['NAME'] for r in attribute_records)
    addfields = dict(PERSON_ATTRIBUTES=person_attributes,
                     PERSON_NAME=person_name_record['NAME'])
    person_name_record.update(addfields)
    for r in attribute_records: r.update(addfields)

for r in data: print r

这会为前几个人打印出您想要的结果（并且每个人都会被单独对待，因此对于几十万人来说它应该是相同的;）。

Answer 2

我会把它分成两个任务。

首先，将thedata划分为LTYPE=N行的组和其后的LTYPE=A行。

def group_name_and_attributes(thedata):
    group = []
    for row in thedata:
        if row['LTYPE'] == 'N':
            if group:
                yield group
            group = [row]
        else:
            group.append(row)
    if group:
        yield group

接下来，将每个组分开并收集每个组的总属性;然后很容易根据需要将sum属性添加到每一行。

def join_person_attributes(thedata):
    for group in group_name_and_attributes(thedata):
        attributes = ' '.join(row['NAME'] for row in group if row['LTYPE'] == 'A')
        for row in group:
            new_row = row.copy()
            new_row['PERSON_ATTRIBUTES'] = attributes
            yield new_row

new_data = list(join_person_attributes(thedata))

当然，您可以将其修改为就地行，或者每组只返回一行，或者......

Python使用基于索引的列表

2 个答案: