我有包含48000行作为输入文本和答案的数据集。有89个唯一的答案值。我如何从文本答案标签答案中得出答案,例如1个唯一值等于answer1,第二个等于answer2,依此类推,直到答案89。
> x y y_val
> hello please push button 1 answer1
> what's up please push button 1 answer1
> be cool please push button 1 answer1
>smth please push button 1 answer1
>write num please push button 1 answer1
>hello please push button 1 answer1
>what's up please push button 1 answer1
>be cool sure answer2
>smth sure answer2
>write num sure answer2
>hello sure answer2
> what's up perfect answer3
> be cool perfect answer3
>smth call me answer89
>write num call me answer89
================================================ =======================
我想更改“请按按钮1”将变成答案1,“确定”将变成答案2。我有89个唯一值,所以我需要所有这些值都进行更改,以使y_values变成仅包含answer1-answer89的列。 / p>
答案 0 :(得分:0)
我有点困惑,您是否只想将重新编码的列附加到数据框中,以将您的“ y”列值标记为answer1-answer89?
如果是,此代码将为您做到这一点:
seen = set()
y_val = []
x = list(range(1,50))
for i in range(len(data)):
if any((str(data.iloc[i,1]) == y) for y in seen):
y_val.append(y_val[-1])
else:
y_val.append('answer'+str(x[0]))
seen.add(str(data.iloc[i,1]))
x.pop(0)
data['y_values'] = y_val
print(data)
这种处理方式假设数据按“ y”列按字母顺序排序,并且您可以按照该顺序进行重新编码。只需将“ data”替换为熊猫数据集的名称,并确保iloc适用于您的列。我敢肯定,有一种更有效或更pythonic的方式来做到这一点,但这就是我的想法。
希望这对您有帮助!