Question

在A栏中有3个句子（ABC，DEF，GHI）。句子以 ~~＆amp;开头。 **结束于＆＃34;~~ ＆＃34; **

例如：这是一个单句

 Column A                           Column B

(('<s>', '<s>'),  'abc')            0.043025210084033615
(('<s>', 'abc'),  'abc')            0.65234375
(('abc', 'abc'),  'abc')            0.04259501965923984
(('abc', 'abc'),  'abc')            0.18604651162790697
(('abc', 'abc'),  '</s>')           0.41317365269461076
(('abc', '</s>'), '</s>')           0.011148272017837236

当句子以结束标记结束时，＆＃34;＆＃39;），＆＃39;＆＃39;＆＃34;＆＃39;＆＃39;＆＃39;＆＃39;我想将B列乘以特定句子的所有行示例：（0.04302521 * 0.65234375 * 0.04259502 * 0.186046512 * 0.413173653 * 0.011148272 = 1.02452）

我希望每个数据框得到一个输出

 Column A                           Column B



 (('<s>', '<s>'),  'abc')            0.043025210084033615
 (('<s>', 'abc'),  'abc')            0.65234375
 (('abc', 'abc'),  'abc')            0.04259501965923984
 (('abc', 'abc'),  'abc')            0.18604651162790697
 (('abc', 'abc'),  '</s>')           0.41317365269461076
 (('abc', '</s>'), '</s>')           0.011148272017837236
 (('<s>', '<s>'),  'def')            0.09090909090909091
 (('def', 'def'),  'def')            0.008287292817679558
 (('def', 'def'),  'def')            0.13506493506493505
 (('def', 'def'),  '</s>')           0.007653061224489796
 (('def', '</s>'), '</s>')           0.08333333333333333
 (('<s>', '<s>'),  'ghi')            0.5
 (('ghi', 'ghi'),  'ghi')            0.125
 (('ghi', 'ghi'),  'ghi')            0.033766233766233764
 (('ghi', 'ghi'),  '</s>')           0.0694980694980695
 (('ghi','</s>'),  '</s>')           0.16666666666666666

输出应为：（0.04302521 * 0.65234375 * 0.04259502 * 0.186046512 * 0.413173653 * 0.011148272 = 1.02452）（0.090909091 * 0.008287293 * 0.135064935 * 0.007653061 * 0.083333333 = 6.48958）（0.5 * 0.125 * 0.033766234 * 0.069498069 * 0.166666667 = 2.44447）

输出应采用以下格式 1.02452 6.48958 2.44447

Answer 1

一种方法可能是创建一列“句子”以供以后使用groupby。假设您的数据帧称为df。我用0创建此列。

df['sentence'] = 0

现在，在该列A包含('<s>', '<s>')的此列中放入1，然后使用cumsum为每个句子指定一个不同的数字：

df['sentence'].loc[df['Column A'].str.contains("('<s>', '<s>')")] = 1
df['sentence'] = df['sentence'].cumsum()

您需要做的是对此列进行修饰，并使用prod

df.groupby('sentence')['Column B'].prod()
Out[527]: 
sentence
1.0    1.024519e-06
2.0    6.489579e-08
3.0    2.444467e-05
Name: Column B, dtype: float64

取决于结果的精确度，可以使用df.groupby('sentence')['Column B'].prod().tolist()将其作为列表获取

我有pandas数据帧，其中包含两列（A列和B列）

1 个答案: