假设我具有以下数据集:
pos sentence_idx word
NNS 1.0 Thousands
IN 1.0 of
NNS 1.0 demonstrators
VBP 1.0 have
VBN 1.0 marched
... ... ... ...
PRP 47959.0 they
VBD 47959.0 responded
TO 47959.0 to
DT 47959.0 the
NN 47959.0 attack
我想创建句子(为此,我必须使用句子_idx)。我可以使用以下代码进行此操作:
sent = []
for i in df['sentence_idx'].unique():
sent.append([(w,t) for w,t in zip(df[df['sentence_idx'] == i]['word'].values.tolist(),df[df['sentence_idx'] == i]['pos'].values.tolist())])
但是首先它效率不高(使用for循环而不是numpy / pandas函数),而且看起来很丑。 我如何才能更有效地做到这一点?
编辑: 结果应该是句子,其中每个元素都是一个元组(word,pos):
[[('Thousands', 'NNS'),
('of', 'IN'),
('demonstrators', 'NNS'),
('have', 'VBP'),
('marched', 'VBN'),
('through', 'IN'),
('London', 'NNP'),
('to', 'TO'),
('protest', 'VB'),
('the', 'DT'),
('war', 'NN'),
('in', 'IN'),
('Iraq', 'NNP'),
('and', 'CC'),
('demand', 'VB'),
('withdrawal', 'NN'),
('British', 'JJ'),
('troops', 'NNS'),
('from', 'IN'),
('that', 'DT'),
('country', 'NN'),
('.', '.')],
[('Families', 'NNS'),
('of', 'IN'),
('soldiers', 'NNS'),
('killed', 'VBN'),
('in', 'IN'),
('the', 'DT'),
('conflict', 'NN'),
('joined', 'VBD'),
('protesters', 'NNS'),
('who', 'WP'),
('carried', 'VBD'),
('banners', 'NNS'),
('with', 'IN'),
('such', 'JJ'),
('slogans', 'NNS'),
('as', 'IN'),
('"', '``'),
('Bush', 'NNP'),
('Number', 'NN'),
('One', 'CD'),
('Terrorist', 'NN'),
('and', 'CC'),
('Stop', 'VB'),
('Bombings', 'NNS'),
('.', '.')],...
答案 0 :(得分:2)
这应该有效:
def compute(_):
return [*zip(_['word'], _['pos'])]
df.groupby('sentence_idx').apply(compute).values.tolist()
答案 1 :(得分:1)
不确定效率,但这是实现此目的的一些方法:
df.groupby('sentence_idx')[['word', 'pos']].apply(lambda x: list(zip(*zip(*x.values.tolist())))).tolist()
df.groupby('sentence_idx').apply(lambda x: x[['word', 'pos']].apply(tuple, axis=1).tolist())
df.groupby('sentence_idx').apply(lambda x: [tuple(y) for y in x[['word', 'pos']].values]).tolist()
如果您不一定像tuple
那样需要它(即list
就可以了),则它要简单得多:
df.groupby('sentence_idx').apply(lambda x: x[['word', 'pos']].values.tolist()).tolist()