熊猫系列似乎无法拆分列表?

时间:2018-02-12 02:33:15

标签: python pandas split series

我有一个包含多个类型的列,我正在尝试拆分类型列表,以便分别获取每个类型,无论我尝试什么,我都会在数据框中为整个列获取NaN。

这就是数据的样子:

0                                      [Drama,, Romance]
1                 [Animation,, Comedy,, Kids, &, Family]
2                         [Drama,, Mystery, &, Suspense]
3                                                [Drama]
4                                                    NaN
5                 [Art, House, &, International,, Drama]
6       [Art, House, &, International,, Drama,, Romance]
7                                          [Documentary]
8      [Action, &, Adventure,, Animation,, Art, House...
9               [Action, &, Adventure,, Drama,, Western]
10                                     [Comedy,, Horror]

我想要: ["戏剧","浪漫"] ["动画","喜剧","儿童&家庭"] ......

我这样做是因为我希望能够看到有多少独特的流派,目前我只能看到独特的列表,但我想要每个独特的流派。 我甚至不确定我是否以正确的方式解决这个问题,所以非常感谢任何帮助。

这是我最近的尝试: (x等于显示的数据加上更多行)

 x = pd.Series(x)
 x = x.str.split()
 [i.str.split() for i in x]

非常感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

您的数据似乎与一些无关的逗号不一致。假设您的数据实际上是string,则您需要eval列表的string表示形式为list

几步:

# First, import ast to use for literal_eval()
import ast

# Then, remove the extraneous commas
new_df = df[0].str.replace(', ',' ')

# Then, add quotes into your listed items to prep for eval.
new_df = new_df.str.replace(r'(?P<item>\b[\w &]+)',r'"\1"')

# Then, eval the string representation
lst = [ast.literal_eval(i) for i in new_df if pd.notnull(i)]

# Or, you can just put all of this together:
lst = [ast.literal_eval(i) for i in df[0].str.replace(', ',' ').str.replace(r'(?P<item>\b[\w &]+)',r'"\1"') if pd.notnull(i)]

<强>输出:

[['Drama', 'Romance'],
 ['Animation', 'Comedy', 'Kids & Family'],
 ['Drama', 'Mystery & Suspense'],
 ['Drama'],
 ['Art House & International', 'Drama'],
 ['Art House & International', 'Drama', 'Romance'],
 ['Documentary'],
 ['Action & Adventure', 'Animation', 'Art House'],
 ['Action & Adventure', 'Drama', 'Western'],
 ['Comedy', 'Horror']]

如果您想要索引并将其表示为字典:

 d = {i: ast.literal_eval(j) for i, j in new_df.items() if pd.notnull(j)}

<强>输出:

{0: ['Drama', 'Romance'],
 1: ['Animation', 'Comedy', 'Kids & Family'],
 2: ['Drama', 'Mystery & Suspense'],
 3: ['Drama'],
 5: ['Art House & International', 'Drama'],
 6: ['Art House & International', 'Drama', 'Romance'],
 7: ['Documentary'],
 8: ['Action & Adventure', 'Animation', 'Art House'],
 9: ['Action & Adventure', 'Drama', 'Western'],
 10: ['Comedy', 'Horror']}

如果你想在DataFrame中使用它,我不确定你想要的是什么,但是一旦你有了dictlist它恢复原状是微不足道的。