Python .split()在每个可拆分令牌栏空格上的字符串上,但忽略某些特定字符串

时间:2016-07-06 15:59:04

标签: python regex string split

我希望将一个句子分成标记,但忽略2个特定字符串并忽略空格。

例如:

GNI per capita ; PPP -LRB- US dollar -RRB- in LOCATION_SLOT was last measured at NUMBER_SLOT in 2011 , according to the World Bank .

应分为[GNI,per,capita,;,PPP,-,LRB,-,US,dollar,-,RRB,-,in, LOCATION_SLOT,was,last,measured,at,NUMBER_SLOT,in,2011,,,according,to, the, World,Bank,.,]

我不希望将LOCATION_SLOTNUMBER_SLOT拆分,例如将前者划分为[LOCATION,_,SLOT]。但我确实想说明点数。

我目前的功能只允许基于字符的单词但删除数字和;,,,:之类的内容在这里 - 我不希望它删除这些:

def sentence_to_words(sentence,remove_stopwords=False):
    letters_only = re.sub("[^a-zA-Z| LOCATION_SLOT | NUMBER_SLOT]", " ", sentence)
    words = letters_only.lower().split() 
    if remove_stopwords:
            stops = set(stopwords.words("english"))
            words = [w for w in words if not w in stops]
    return(words)

这会生成这些令牌:

gni per capita ppp lrb us dollar rrb location_slot last measured number_slot according world bank

2 个答案:

答案 0 :(得分:1)

您可以简单地使用拆分

>>> x = "GNI per capita ; PPP -LRB- US dollar -RRB- in LOCATION_SLOT was last measured at NUMBER_SLOT in 2011 , according to the World Bank ."
>>>
>>> x.split()
['GNI', 'per', 'capita', ';', 'PPP', '-LRB-', 'US', 'dollar', '-RRB-', 'in', 'LOCATION_SLOT', 'was', 'last', 'measured', 'at', 'NUMBER_SLOT', 'in', '2011', ',', 'according', 'to', 'the', 'World', 'Bank', '.']

删除-LBR周围 - 执行此操作:

>>> z = [y.strip('-') for y in x]
>>> z
['GNI', 'per', 'capita', ';', 'PPP', 'LRB', 'US', 'dollar', 'RRB', 'in', 'LOCATION_SLOT', 'was', 'last', 'measured', 'at', 'NUMBER_SLOT', 'in', '2011', ',', 'according', 'to', 'the', 'World', 'Bank', '.']
>>> 

如果你想保留破折号:

>>> y = []
>>> for item in x:
...   if item.startswith('-') and item.endswith('-'):
...     y.append(',')
...     y.append(item.strip('-'))
...     y.append('-')
...   else:
...     y.append(item)
... 

答案 1 :(得分:1)

您可以使用re.findall并从开始和结束中删除空格

>>> [x.strip() for x in re.findall('\s*(\w+|\W+)', line)]
#['GNI', 'per', 'capita', ';', 'PPP', '-', 'LRB', '-', 'US', 'dollar', '-', 'RRB', '-', 'in', 'LOCATION_SLOT', 'was', 'last', 'measured', 'at', 'NUMBER_SLOT', 'in', '2011', ',', 'according', 'to', 'the', 'World', 'Bank', '.']

正则表达式解释

> \w matches word character [A-Za-z0-9_].
> \W is negation of \w. i.e. it matches anything except word character.