使用正则表达式从字符串创建bigrams

时间:2016-08-17 17:08:37

标签: python regex string list n-gram

我有一个字符串:

"[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"

取自Excel文件。这看起来像一个数组,但因为它是从文件中提取的,所以它只是一个字符串。

我需要做的是:

a)删除[ ]

b)将字符串拆分为,,从而实际创建一个新列表

c)仅取第一个字符串,即u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT'

d)将结果字符串的bigrams创建为空格(而不是bigrams)的实际字符串吐出:

LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to *extend*to~prepc_according_to+expectancy~-nsubj expectancy~-nsubj+is~parataxis  is~parataxis+NUMBER~nsubj NUMBER~nsubj+NUMBER_SLOT

我一直在玩的当前代码片段。

text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
text = re.sub('^\[(.*)\]',"\1",text)
text = [text.split(",")[0]]
bigrams = [b for l in text for b in zip(l.split("+")[:-1], l.split("+")[1:])]
bigrams = [("+").join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = (' ').join(map(str, bigrams))
bigrams = ('').join(bigrams)

我的正则表达式似乎没有任何回报。

2 个答案:

答案 0 :(得分:1)

你的字符串看起来像是unicode字符串的Python列表,对吗?

您可以对其进行评估以获取unicode字符串列表。一个好方法是使用ast模块中的ast.literal_eval函数。

简单地写一下:

text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT'," \
       " u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"

import ast

lines = ast.literal_eval(text)

结果是unicode字符串列表:

for line in  lines:
    print(line)

你会得到:

LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT
LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT    

计算双字母

bigrams = [b for l in lines for b in zip(l.split("+")[:-1], l.split("+")[1:])]
bigrams = ["+".join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = ' '.join(map(str, bigrams))
bigrams = ''.join(bigrams)

答案 1 :(得分:0)

我已经解决了这个问题。正则表达式需要经过两次才能首先替换括号,然后获取第一个字符串,然后删除引号:

   text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
                        text =  re.sub(r'\[u|\]',"",text)
                        text = text.split(",")[0]
                        text = re.sub(r'\'',"",text)
                        text = text.split("+")
                        bigrams = [text[i:i+2] for i in xrange(len(text)-2)]
                        bigrams = [("+").join(bigram).encode('utf-8') for bigram in bigrams]
                        bigrams = (' ').join(map(str, bigrams))