Question

我是熊猫新手，尝试用一些数据练习。我得到了以下格式的训练数据集这是电影评论的数据集。如何从这种数据中生成DataFrame以用于SVM分类。我已经练习了[12000 * 12]大小的数据进行分类，其中每一行都有相同数量的属性。但在这里，属性长度不等。我该如何修改呢。

PhraseId    SentenceId  Phrase  Sentiment
1   1   Wanker Goths are on the loose ! 2
2   1   Wanker Goths    2
3   1   Wanker  2
4   1   Goths   2
5   1   are on the loose !  2
6   1   are on the loose    2
7   1   on the loose    2
8   1   the loose   2
9   2   made Eddie Murphy a movie star and the man has n't aged a day . 3
10  2   made Eddie Murphy a movie star and the man  3
11  2   Eddie Murphy a movie star and the man   2
12  2   a movie star and the man    2
13  2   a movie star and    2
14  2   has n't aged a day .    2
15  2   has n't aged a day  3
16  2   aged a day  2

这是实际培训dataset（部分）。

我的目标是使用数字数据映射从此数据集形成一个DataFrame，以便我可以使用该数据帧对Sentiment进行分类。

Answer 1

使用纯python：

t = """PhraseId    SentenceId  Phrase  Sentiment
1   1   Wanker Goths are on the loose ! 2
2   1   Wanker Goths    2
3   1   Wanker  2
4   1   Goths   2
5   1   are on the loose !  2"""

按换行符拆分字符串：

t = t.split('\n')

然后获取分割字符串列表：

s = [i.split() for i in t]

然后合并短语并获取数据框：

import pandas as pd
df = pd.DataFrame([(i[0],i[1],' '.join(i[2:-1]),i[-1]) for i in s],columns=s[0])
df = df.ix[1:]
print df

如何从电影评论数据集中创建用于分类的Dataframe？

1 个答案: