Question

我有导入到Pandas数据框中的数据，在该数据框中，作为列表的元素会自动分成新的列。我的数据最初是.root个文件，我正在使用Uproot

将其导入到Pandas中。

下面是示例数据，其中列physics [0]和physics 2最初是列表的元素

data = {'physics[0]': [1,2,3], 'physics[1]': [4,5,6], 'yes': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)



   physics[0]  physics[1]  yes  no 
0           1           4    7  10  
1           2           5    8  11  
2           3           6    9  12

我试图提出一种技术来检测相似的列名并重新创建元素作为列表。这是我到目前为止的内容：

lst = [col for col in df.columns if 'physics' in col]

df['physics']=df[lst].values.tolist()

    yes  no physics
0    7  10  [1, 4]
1    8  11  [2, 5]
2    9  12  [3, 6]

有效。我不会总是事先知道发生这种情况时列的名称。但我希望能够自动检测名称是否相似，并执行上面的列表理解。

Answer 1

您可以使用正则表达式来概括您的方法：

import re
# create dictionary d of all groups of similar columns
multi_cols = filter(lambda x: re.search(r'\[[0-9]+\]$',x),df.columns)
d = {}
for c in multi_cols:
    k = re.sub(r'\[[0-9]+\]$', '' , str(c))
    if k not in d:
        d[k] = []
    d[k].append(c)

# the dictionary will be as following:
print(d)
# {'physics': ['physics[0]', 'physics[1]']}

# use dictionary d to combine all similar columns in each group
for k in d:
    df[k] = df[d[k]].values.tolist()

Answer 2

可能值得一试difflib。您可以从列标题创建l1和l2列表，然后利用difflib的匹配项：

windowLevel = .alert

Answer 3

我们可以假定任何重复的列始终始终包含[0]吗？

类似这样的东西-

data = {'physics[0]': [1,2,3], 'physics[1]': [4,5,6], 'yes': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

duplicates = set([])
columns = df.columns
for c in columns:
    if c.endswith('[0]') and c.replace('[0]', '') not in duplicates:
        duplicates.add(c.replace('[0]', ''))

for d in duplicates:
    lst = [col for col in df.columns if d in col]
    df[d]=df[lst].values.tolist()

如果相似并处理，则检测熊猫列名称

3 个答案: