Question

我有一个文件，其中包含以下数据：

x y
z w
a b
a x
w y

我想用以下替换字典创建一个文件，该字典的每个字符串都有一个唯一的替换编号，该字符串由字符串从文件中首次出现的顺序确定（从左到右和从上到下读取）底部（请注意，应创建此文件，不提供）：

{'x':1, 'y':2, 'z':3, 'w':4 , 'a':5, 'b':6}

，输出文件将是：

是否有任何有效的方法可以使用Pandas创建处理后的文件和字典？

我想到了以下政策来创建字典：

_counter = 0
def counter():
    global _counter
    _counter += 1
    return _counter
replacements_dict = collections.defaultdict(counter)

Answer 1

您可以将factorize与stack创建的MultiIndex Series一起使用，然后再使用unstack，最后由to_csv写入文件：

df = pd.read_csv(file, sep="\s+", header=None)

print (df)
   0  1
0  x  y
1  z  w
2  a  b
3  a  x
4  w  y

s = df.stack()
fact = pd.factorize(s)

#indexing is necessary
d = dict(zip(fact[1].values[fact[0]], fact[0] + 1))
print (d)
{'x': 1, 'y': 2, 'z': 3, 'w': 4, 'a': 5, 'b': 6}

对于新文件：

#values splited by ,
pd.Series(d).to_csv('dict.csv')
#read Series from file, convert to dict
d = pd.read_csv('dict.csv', index_col=[0], squeeze=True, header=None).to_dict()
print (d)
{'x': 1, 'y': 2, 'z': 3, 'w': 4, 'a': 5, 'b': 6}

df = pd.Series(fact[0] + 1, index=s.index).unstack()
print (df)

   0  1
0  1  2
1  3  4
2  5  6
3  5  1
4  4  2

df.to_csv('out', index=False, header=None)

Answer 2

我假设您要使用字典d，以便分配给键的值与键的外观在行中相对应：

d={'col1':['x', 'y', 'a', 'a', 'w'], 'col2':['z','w','b','x','y']}
df=pd.DataFrame(d)

print(df)

输出：

  col1 col2
0    x    z
1    y    w
2    a    b
3    a    x
4    w    y

================================

使用 itertools ：

import itertools
raw_list = list(itertools.chain(*[df.iloc[i].tolist() for i in range(df.shape[0])]))
d=dict()
counter=1
for k in raw_list:
    try: 
        _=d[k]
    except:
        d[k]=counter
        counter+=1

然后：

输出：

{'a': 5, 'b': 6, 'w': 4, 'x': 1, 'y': 3, 'z': 2}

希望对您有帮助！

=========================================

使用分解：

s = df.stack()
d=dict{}
for (x,y) in zip(pd.factorize(s)[1], pd.factorize(s)[0]+1):
    d[x]=y

如何使用熊猫替换DataFrame中的列条目并创建字典新旧值

2 个答案: