工作流程如下:
我已经完成了第3步,即直到名为“ end”的列
我无法像在ExpectedFlag中那样弄清楚如何在开始和结束之间标记值。有什么方法可以使用pandas操作进行标记?
text LWS LineNum start end ExpectedFlag
0 somethin 3 2 0 0 0
1 somethin 3 2 0 0 0
2 somethin 2 2 0 0 0
3 value 70 2 1 0 1
4 value 3 2 0 0 1
5 value: 3 2 0 1 1
6 val1 200 3 1 0 1
7 val1: 3 3 0 1 1
8 val2 3 3 0 0 0
9 val2 100 3 1 0 1
10 val2: 3 3 0 1 1
11 djsal 3 3 0 0 0
12 jdsal 3 3 0 0 0
13 ajsd 3 3 0 0 0
答案 0 :(得分:1)
关于start
和end
之间的填充值,可以根据this answer如下进行:
数据:
df = pd.DataFrame([[0,0],[0,0],[0,0],[1,0],[0,0],[0,1],[0,0],[0,0],[1,0],[0,1],[0,0],[0,0],[0,0],[0,0],[1,0],[0,0],[0,0],[0,1],[0,0],[0,0],[0,0],],columns=['start','end'])
start end
0 0 0
1 0 0
2 0 0
3 1 0
4 0 0
5 0 1
6 0 0
7 0 0
8 1 0
9 0 1
10 0 0
获取start
和end
的索引:
s = df.start.nonzero()[0]
e = df.end.nonzero()[0]
>>> s, e
(array([3, 8], dtype=int64), array([5, 9], dtype=int64))
重塑原始索引:
>>> index = df.index.values.reshape(-1,1)
array([[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10]], dtype=int64)
然后我们可以利用numpy的broadcasting:
>>> index < [1] >>> index < [1,2,3,4,5]
array([[ True], array([[ True, True, True, True, True],
[False], [False, True, True, True, True],
[False], [False, False, True, True, True],
[False], [False, False, False, True, True],
[False], [False, False, False, False, True],
[False], [False, False, False, False, False],
[False], [False, False, False, False, False],
[False], [False, False, False, False, False],
[False], [False, False, False, False, False],
[False], [False, False, False, False, False],
[False]]) [False, False, False, False, False]])
为每个start
-end
对生成一个条件:
>>> ((s <= index) & (index <= e))
array([[False, False],
[False, False],
[False, False],
[ True, False],
[ True, False],
[ True, False],
[False, False],
[False, False],
[False, True],
[False, True],
[False, False]])
然后使用sum
:
df['Expected Flag'] = ((s <= index) & (index <= e)).sum(axis=1)
start end Expected Flag
0 0 0 0
1 0 0 0
2 0 0 0
3 1 0 1
4 0 0 1
5 0 1 1
6 0 0 0
7 0 0 0
8 1 0 1
9 0 1 1
10 0 0 0
单线:
((df.start.nonzero()[0] <= df.index.values.reshape(-1,1)) & (df.index.values.reshape(-1,1) <= df.end.nonzero()[0])).sum(axis=1)
答案 1 :(得分:1)
您可以编写并应用函数来做到这一点:
def proc():
started = False
def inner(b):
nonlocal started
if started:
if b == 1:
started = False
return 1
else:
if b == 1:
started = True
return 1
return 0
return inner
df['ExpectedFlag'] = (df['start'] + df['end']).apply(proc())
df
返回
text LWS LineNum start end ExpectedFlag
0 somethin 3 2 0 0 0
1 somethin 3 2 0 0 0
2 somethin 2 2 0 0 0
3 value 70 2 1 0 1
4 value 3 2 0 0 1
5 value: 3 2 0 1 1
6 val1 200 3 1 0 1
7 val1: 3 3 0 1 1
8 val2 3 3 0 0 0
9 val2 100 3 1 0 1
10 val2: 3 3 0 1 1
11 djsal 3 3 0 0 0
12 jdsal 3 3 0 0 0
13 ajsd 3 3 0 0 0