将DataFrame的列更改为设置类型

时间:2019-08-13 11:06:27

标签: python pandas set

我遇到了一些熊猫问题。

我有以下DataFrame:

name     random_words
Anne     [hello, hi, bye]
John     [red, blue, green, yellow, grey, black]
Marie    [orange, lemon, pear, apple]
Mark     [cat, dog]

我使用pd.read_csv()函数加载了DataFrame。问题是我需要将random_words列设置为set类型。

我尝试使用astype()函数,但是它不起作用。

2 个答案:

答案 0 :(得分:1)

将值转换为具有stripsplit的列表,然后转换为set

print (df)

    name                             random_words
0   Anne                         [hello, hi, bye]
1   John  [red, blue, green, yellow, grey, black]
2  Marie             [orange, lemon, pear, apple]
3   Mark                               [cat, dog]

print (type(df.loc[0,'random_words']))
<class 'str'>

df['random_words'] = df['random_words'].str.strip('[]').str.split(', ').apply(set)
print (df)
    name                             random_words
0   Anne                         {bye, hi, hello}
1   John  {yellow, grey, blue, red, green, black}
2  Marie             {pear, lemon, apple, orange}
3   Mark                               {dog, cat}

或者在自定义lambda函数中:

df['random_words'] = df['random_words'].apply(lambda x: set(x.strip('[]').split(', ')))

如果字符串周围有''(不在示例数据中,但在实际数据中可能):

import ast
df['random_words'] = df['random_words'].apply(lambda x: set(ast.literal_eval(x)))

如果值是列表:

print (type(df.loc[0,'random_words']))
<class 'list'>

df['random_words'] = df['random_words'].apply(set)

编辑:

如果出现此错误,显然是缺少值的问题:

print (df)

    name                             random_words
0   Anne                                      NaN
1   John  [red, blue, green, yellow, grey, black]
2  Marie             [orange, lemon, pear, apple]
3   Mark                               [cat, dog]

df['random_words'] = df['random_words'].str.strip('[]').str.split(', ').apply(set)
print (df)
  

TypeError:“ float”对象不可迭代

然后可以将其转换为字符串,但是可以使用NaN的字符串repr进行设置(什么应该是完全可以的,取决于需要):

df['random_words'] = df['random_words'].astype(str).str.strip('[]').str.split(', ').apply(set)
print (df)
    name                             random_words
0   Anne                                    {nan}
1   John  {yellow, grey, blue, red, green, black}
2  Marie             {pear, lemon, apple, orange}
3   Mark                               {dog, cat}

答案 1 :(得分:0)

df = pd.DataFrame({"name": ["Anne", "John", "Marie", "Mark"],
              "random_words":[["hello", "hi", "bye"], 
                              ["red", "blue", "green", "yellow", "grey", "black"],
                              ["orange", "lemon", "pear", "apple"],
                              ["cat", "dog"]]})

df['random_words'] = df['random_words'].apply(set)
df

    name    random_words
0   Anne    {hi, bye, hello}
1   John    {blue, yellow, green, black, red, grey}
2   Marie   {orange, pear, apple, lemon}
3   Mark    {cat, dog}