熊猫:在另一列的标记之间标记值

时间:2018-09-18 14:32:12

标签: python pandas data-science

工作流程如下:

  1. 然后通过LineNum分组
  2. 将LWS列中大于50的值标记为“开始”
  3. 在包含“:”(冒号)的“文本”列中将值标记为“结束”
  4. 在“ ExpectedFlag”中将开始和结束之间的值标记为1

我已经完成了第3步,即直到名为“ end”的列

我无法像在ExpectedFlag中那样弄清楚如何在开始和结束之间标记值。有什么方法可以使用pandas操作进行标记?

        text  LWS LineNum   start   end     ExpectedFlag
0   somethin    3       2       0     0                0
1   somethin    3       2       0     0                0
2   somethin    2       2       0     0                0
3   value      70       2       1     0                1
4   value       3       2       0     0                1
5   value:      3       2       0     1                1
6   val1      200       3       1     0                1
7   val1:       3       3       0     1                1
8   val2        3       3       0     0                0
9   val2      100       3       1     0                1
10  val2:       3       3       0     1                1
11  djsal       3       3       0     0                0
12  jdsal       3       3       0     0                0
13  ajsd        3       3       0     0                0

2 个答案:

答案 0 :(得分:1)

关于startend之间的填充值,可以根据this answer如下进行:

数据:

df = pd.DataFrame([[0,0],[0,0],[0,0],[1,0],[0,0],[0,1],[0,0],[0,0],[1,0],[0,1],[0,0],[0,0],[0,0],[0,0],[1,0],[0,0],[0,0],[0,1],[0,0],[0,0],[0,0],],columns=['start','end'])

   start end
0   0   0
1   0   0
2   0   0
3   1   0
4   0   0
5   0   1
6   0   0
7   0   0
8   1   0
9   0   1
10  0   0

获取startend的索引:

s = df.start.nonzero()[0]
e = df.end.nonzero()[0]
>>> s, e
(array([3, 8], dtype=int64), array([5, 9], dtype=int64))

重塑原始索引:

>>> index = df.index.values.reshape(-1,1)

array([[ 0],
   [ 1],
   [ 2],
   [ 3],
   [ 4],
   [ 5],
   [ 6],
   [ 7],
   [ 8],
   [ 9],
   [10]], dtype=int64)

然后我们可以利用numpy的broadcasting

>>> index < [1]       >>> index < [1,2,3,4,5]
array([[ True],       array([[ True,  True,  True,  True,  True],
       [False],             [False,  True,  True,  True,  True],
       [False],             [False, False,  True,  True,  True],
       [False],             [False, False, False,  True,  True],
       [False],             [False, False, False, False,  True],
       [False],             [False, False, False, False, False],
       [False],             [False, False, False, False, False],
       [False],             [False, False, False, False, False],
       [False],             [False, False, False, False, False],
       [False],             [False, False, False, False, False],
       [False]])            [False, False, False, False, False]])

为每个start-end对生成一个条件:

>>> ((s <= index) & (index <= e))

array([[False, False],
       [False, False],
       [False, False],
       [ True, False],
       [ True, False],
       [ True, False],
       [False, False],
       [False, False],
       [False,  True],
       [False,  True],
       [False, False]])

然后使用sum

 df['Expected Flag'] = ((s <= index) & (index <= e)).sum(axis=1)

    start  end  Expected Flag
0       0    0              0
1       0    0              0
2       0    0              0
3       1    0              1
4       0    0              1
5       0    1              1
6       0    0              0
7       0    0              0
8       1    0              1
9       0    1              1
10      0    0              0

单线: ((df.start.nonzero()[0] <= df.index.values.reshape(-1,1)) & (df.index.values.reshape(-1,1) <= df.end.nonzero()[0])).sum(axis=1)

答案 1 :(得分:1)

您可以编写并应用函数来做到这一点:

def proc():
    started = False
    def inner(b):
        nonlocal started
        if started:
            if b == 1:
                started = False
            return 1
        else:
            if b == 1:
                started = True
                return 1
            return 0
    return inner

df['ExpectedFlag'] = (df['start'] + df['end']).apply(proc())
df

返回

        text  LWS  LineNum  start  end  ExpectedFlag
0   somethin    3        2      0    0             0
1   somethin    3        2      0    0             0
2   somethin    2        2      0    0             0
3      value   70        2      1    0             1
4      value    3        2      0    0             1
5     value:    3        2      0    1             1
6       val1  200        3      1    0             1
7      val1:    3        3      0    1             1
8       val2    3        3      0    0             0
9       val2  100        3      1    0             1
10     val2:    3        3      0    1             1
11     djsal    3        3      0    0             0
12     jdsal    3        3      0    0             0
13      ajsd    3        3      0    0             0