Question

我有一个包含多个类型的列，我正在尝试拆分类型列表，以便分别获取每个类型，无论我尝试什么，我都会在数据框中为整个列获取NaN。

这就是数据的样子：

0                                      [Drama,, Romance]
1                 [Animation,, Comedy,, Kids, &, Family]
2                         [Drama,, Mystery, &, Suspense]
3                                                [Drama]
4                                                    NaN
5                 [Art, House, &, International,, Drama]
6       [Art, House, &, International,, Drama,, Romance]
7                                          [Documentary]
8      [Action, &, Adventure,, Animation,, Art, House...
9               [Action, &, Adventure,, Drama,, Western]
10                                     [Comedy,, Horror]

我想要： [＆＃34;戏剧＆＃34;，＆＃34;浪漫＆＃34;] [＆＃34;动画＆＃34;，＆＃34;喜剧＆＃34;，＆＃34;儿童＆amp;家庭＆＃34;] ......

我这样做是因为我希望能够看到有多少独特的流派，目前我只能看到独特的列表，但我想要每个独特的流派。我甚至不确定我是否以正确的方式解决这个问题，所以非常感谢任何帮助。

这是我最近的尝试：（x等于显示的数据加上更多行）

 x = pd.Series(x)
 x = x.str.split()
 [i.str.split() for i in x]

非常感谢您的帮助！

Answer 1

您的数据似乎与一些无关的逗号不一致。假设您的数据实际上是string，则您需要eval列表的string表示形式为list。

几步：

# First, import ast to use for literal_eval()
import ast

# Then, remove the extraneous commas
new_df = df[0].str.replace(', ',' ')

# Then, add quotes into your listed items to prep for eval.
new_df = new_df.str.replace(r'(?P<item>\b[\w &]+)',r'"\1"')

# Then, eval the string representation
lst = [ast.literal_eval(i) for i in new_df if pd.notnull(i)]

# Or, you can just put all of this together:
lst = [ast.literal_eval(i) for i in df[0].str.replace(', ',' ').str.replace(r'(?P<item>\b[\w &]+)',r'"\1"') if pd.notnull(i)]

<强>输出：

[['Drama', 'Romance'],
 ['Animation', 'Comedy', 'Kids & Family'],
 ['Drama', 'Mystery & Suspense'],
 ['Drama'],
 ['Art House & International', 'Drama'],
 ['Art House & International', 'Drama', 'Romance'],
 ['Documentary'],
 ['Action & Adventure', 'Animation', 'Art House'],
 ['Action & Adventure', 'Drama', 'Western'],
 ['Comedy', 'Horror']]

如果您想要索引并将其表示为字典：

 d = {i: ast.literal_eval(j) for i, j in new_df.items() if pd.notnull(j)}

<强>输出：

{0: ['Drama', 'Romance'],
 1: ['Animation', 'Comedy', 'Kids & Family'],
 2: ['Drama', 'Mystery & Suspense'],
 3: ['Drama'],
 5: ['Art House & International', 'Drama'],
 6: ['Art House & International', 'Drama', 'Romance'],
 7: ['Documentary'],
 8: ['Action & Adventure', 'Animation', 'Art House'],
 9: ['Action & Adventure', 'Drama', 'Western'],
 10: ['Comedy', 'Horror']}

如果你想在DataFrame中使用它，我不确定你想要的是什么，但是一旦你有了dict或list它恢复原状是微不足道的。

熊猫系列似乎无法拆分列表？

1 个答案: