Question

我正在处理音乐数据，并且需要为回归算法python 3 pandas编码类型分类。我想将类别分类为0或1。数据在pandas数据框中，并且包含重复值。我想将所有行合并到一个唯一值列表中，然后使用get_dummies为每条记录编码。

第一次尝试：

for i in x:
    a = genres + list(i)
    genres.append(a)

第二次尝试：

x = list of genres (like below)
[j for i in x for j in i]

list(itertools.chain(x))

输入：

第1行= ['hip hop', 'rock','pop rock','country']

第2行= ['pop', 'rock', 'pop rock' ,'alternative rock']

预期输出：

new list = ['hip hop', 'rock','country','pop','pop rock','alternative rock']

最终输出

      | hip hop | rock | country | pop | pop rock | alternative rock |
row 1 |   1     | 1    |  1      | 0   | 1        |  0               |
row 2 |   0     | 1    |  0      | 1   | 1        |  1               |

Answer 1

如果元素的顺序并不重要，则可以将每个列表视为set，找到union，然后转换回列表：

def merge(r1, r2):
    return list(set().union(r1, r2))


row_1 = ['hip hop', 'rock','pop rock','country']
row_2 = ['pop', 'rock', 'pop rock' ,'alternative rock']

print(merge(row_1, row_2))

输出

['pop rock', 'alternative rock', 'country', 'hip hop', 'rock', 'pop']

但是，如果顺序（外观）很重要，则可以执行以下操作：

from itertools import chain

def merge_with_order(r1, r2):

    seen = set()
    result = []
    for e in chain(r1, r2):
        if e not in seen:
            seen.add(e)
            result.append(e)

    return result


row_1 = ['hip hop', 'rock','pop rock','country']
row_2 = ['pop', 'rock', 'pop rock' ,'alternative rock']

print(merge_with_order(row_1, row_2))

输出

['hip hop', 'rock', 'pop rock', 'country', 'pop', 'alternative rock']

如果您喜欢单线飞机，请考虑使用collections.OrderedDict：

from itertools import chain
from collections import OrderedDict


def merge_with_order(r1, r2):
    return list(OrderedDict.fromkeys(chain(r1, r2)))

如何合并每个都是字符串列表的行..不包括重复项

1 个答案: