字符串使用正则表达式在矩阵中的列或行上拆分

时间:2017-03-02 17:51:12

标签: python regex numpy matrix group-by

在矩阵中的列上执行re.split的最佳方式是什么 我有兴趣这样做是为了保存与矩阵列中字符串中的字符相关联的不同注释或标签。

到目前为止,我的目的是说明我的观点:

from re import split
from numpy import vstack, zeros

text = 'This is a line of text.  It will need to be split into sentences.'
# binary annotations of characters (1 denotes a character of interest which belongs to a word of interest)
bin_ann = zeros(len(text))
# This is just to label some of the words in the text
text_to_label = ['text', 'sentences']
for l in text_to_label:
    start = text.find(l)
    end = start + len(l)
    bin_ann[start:end] = 1

# we can zip these together and make a matrix such that each character is now labeled.
z = zip(list(text), bin_ann)
nz = vstack(z)
# This is our lebeled text matrix
print(nz)
print(nz[:,0])

s = split('(\.\s+)', nz[:,0])
print(s)

这会产生以下输出:

[['T' '0.0']
 ['h' '0.0']
 ['i' '0.0']
 ['s' '0.0']
 [' ' '0.0']
 ['i' '0.0']
 ['s' '0.0']
 [' ' '0.0']
 ['a' '0.0']
 [' ' '0.0']
 ['l' '0.0']
 ['i' '0.0']
 ['n' '0.0']
 ['e' '0.0']
 [' ' '0.0']
 ['o' '0.0']
 ['f' '0.0']
 [' ' '0.0']
 ['t' '1.0']
 ['e' '1.0']
 ['x' '1.0']
 ['t' '1.0']
 ['.' '0.0']
 [' ' '0.0']
 [' ' '0.0']
 ['I' '0.0']
 ['t' '0.0']
 [' ' '0.0']
 ['w' '0.0']
 ['i' '0.0']
 ['l' '0.0']
 ['l' '0.0']
 [' ' '0.0']
 ['n' '0.0']
 ['e' '0.0']
 ['e' '0.0']
 ['d' '0.0']
 [' ' '0.0']
 ['t' '0.0']
 ['o' '0.0']
 [' ' '0.0']
 ['b' '0.0']
 ['e' '0.0']
 [' ' '0.0']
 ['s' '0.0']
 ['p' '0.0']
 ['l' '0.0']
 ['i' '0.0']
 ['t' '0.0']
 [' ' '0.0']
 ['i' '0.0']
 ['n' '0.0']
 ['t' '0.0']
 ['o' '0.0']
 [' ' '0.0']
 ['s' '1.0']
 ['e' '1.0']
 ['n' '1.0']
 ['t' '1.0']
 ['e' '1.0']
 ['n' '1.0']
 ['c' '1.0']
 ['e' '1.0']
 ['s' '1.0']
 ['.' '0.0']]
['T' 'h' 'i' 's' ' ' 'i' 's' ' ' 'a' ' ' 'l' 'i' 'n' 'e' ' ' 'o' 'f' ' '
 't' 'e' 'x' 't' '.' ' ' ' ' 'I' 't' ' ' 'w' 'i' 'l' 'l' ' ' 'n' 'e' 'e'
 'd' ' ' 't' 'o' ' ' 'b' 'e' ' ' 's' 'p' 'l' 'i' 't' ' ' 'i' 'n' 't' 'o'
 ' ' 's' 'e' 'n' 't' 'e' 'n' 'c' 'e' 's' '.']
['This', ' ', 'is', ' ', 'a', ' ', 'line', ' ', 'of', ' ', 'text', '.', '', '  ', 'It', ' ', 'will', ' ', 'need', ' ', '
to', ' ', 'be', ' ', 'split', ' ', 'into', ' ', 'sentences', '.', '']

我想在将标记组合在一起时保持注释字符矩阵。因此,期望的输出可能是这样的:

[ ['This' 0.0] [' ' 0.0] ['a' 0.0] ... ['text' 1.0] ...['sentences' 1.0] ['.' 0.0] ]

0 个答案:

没有答案