Question

我有一个字符串：

"[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"

取自Excel文件。这看起来像一个数组，但因为它是从文件中提取的，所以它只是一个字符串。

我需要做的是：

a）删除[ ]

b）将字符串拆分为,，从而实际创建一个新列表

c）仅取第一个字符串，即u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT'

d）将结果字符串的bigrams创建为空格（而不是bigrams）的实际字符串吐出：

LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to *extend*to~prepc_according_to+expectancy~-nsubj expectancy~-nsubj+is~parataxis  is~parataxis+NUMBER~nsubj NUMBER~nsubj+NUMBER_SLOT

我一直在玩的当前代码片段。

text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
text = re.sub('^\[(.*)\]',"\1",text)
text = [text.split(",")[0]]
bigrams = [b for l in text for b in zip(l.split("+")[:-1], l.split("+")[1:])]
bigrams = [("+").join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = (' ').join(map(str, bigrams))
bigrams = ('').join(bigrams)

我的正则表达式似乎没有任何回报。

Answer 1

你的字符串看起来像是unicode字符串的Python列表，对吗？

您可以对其进行评估以获取unicode字符串列表。一个好方法是使用ast模块中的ast.literal_eval函数。

简单地写一下：

text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT'," \
       " u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"

import ast

lines = ast.literal_eval(text)

结果是unicode字符串列表：

for line in  lines:
    print(line)

你会得到：

LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT
LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT

计算双字母：

bigrams = [b for l in lines for b in zip(l.split("+")[:-1], l.split("+")[1:])]
bigrams = ["+".join(bigram).encode('utf-8') for bigram in bigrams]
bigrams = ' '.join(map(str, bigrams))
bigrams = ''.join(bigrams)

Answer 2

我已经解决了这个问题。正则表达式需要经过两次才能首先替换括号，然后获取第一个字符串，然后删除引号：

   text = "[u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT', u'LOCATION_SLOT~-prep_in+*extend*to~prepc_according_to+expectancy~-nsubj+is~parataxis+NUMBER~nsubj+NUMBER_SLOT']"
                        text =  re.sub(r'\[u|\]',"",text)
                        text = text.split(",")[0]
                        text = re.sub(r'\'',"",text)
                        text = text.split("+")
                        bigrams = [text[i:i+2] for i in xrange(len(text)-2)]
                        bigrams = [("+").join(bigram).encode('utf-8') for bigram in bigrams]
                        bigrams = (' ').join(map(str, bigrams))

使用正则表达式从字符串创建bigrams

2 个答案: