我希望将一个句子分成标记,但忽略2个特定字符串并忽略空格。
例如:
GNI per capita ; PPP -LRB- US dollar -RRB- in LOCATION_SLOT was last measured at NUMBER_SLOT in 2011 , according to the World Bank .
应分为[GNI,per,capita,;,PPP,-,LRB,-,US,dollar,-,RRB,-,in, LOCATION_SLOT,was,last,measured,at,NUMBER_SLOT,in,2011,,,according,to, the, World,Bank,.,]
。
我不希望将LOCATION_SLOT
或NUMBER_SLOT
拆分,例如将前者划分为[LOCATION,_,SLOT]
。但我确实想说明点数。
我目前的功能只允许基于字符的单词但删除数字和;,,,:
之类的内容在这里 - 我不希望它删除这些:
def sentence_to_words(sentence,remove_stopwords=False):
letters_only = re.sub("[^a-zA-Z| LOCATION_SLOT | NUMBER_SLOT]", " ", sentence)
words = letters_only.lower().split()
if remove_stopwords:
stops = set(stopwords.words("english"))
words = [w for w in words if not w in stops]
return(words)
这会生成这些令牌:
gni per capita ppp lrb us dollar rrb location_slot last measured number_slot according world bank
答案 0 :(得分:1)
您可以简单地使用拆分
>>> x = "GNI per capita ; PPP -LRB- US dollar -RRB- in LOCATION_SLOT was last measured at NUMBER_SLOT in 2011 , according to the World Bank ."
>>>
>>> x.split()
['GNI', 'per', 'capita', ';', 'PPP', '-LRB-', 'US', 'dollar', '-RRB-', 'in', 'LOCATION_SLOT', 'was', 'last', 'measured', 'at', 'NUMBER_SLOT', 'in', '2011', ',', 'according', 'to', 'the', 'World', 'Bank', '.']
删除-LBR周围 - 执行此操作:
>>> z = [y.strip('-') for y in x]
>>> z
['GNI', 'per', 'capita', ';', 'PPP', 'LRB', 'US', 'dollar', 'RRB', 'in', 'LOCATION_SLOT', 'was', 'last', 'measured', 'at', 'NUMBER_SLOT', 'in', '2011', ',', 'according', 'to', 'the', 'World', 'Bank', '.']
>>>
如果你想保留破折号:
>>> y = []
>>> for item in x:
... if item.startswith('-') and item.endswith('-'):
... y.append(',')
... y.append(item.strip('-'))
... y.append('-')
... else:
... y.append(item)
...
答案 1 :(得分:1)
您可以使用re.findall
并从开始和结束中删除空格
>>> [x.strip() for x in re.findall('\s*(\w+|\W+)', line)]
#['GNI', 'per', 'capita', ';', 'PPP', '-', 'LRB', '-', 'US', 'dollar', '-', 'RRB', '-', 'in', 'LOCATION_SLOT', 'was', 'last', 'measured', 'at', 'NUMBER_SLOT', 'in', '2011', ',', 'according', 'to', 'the', 'World', 'Bank', '.']
正则表达式解释
> \w matches word character [A-Za-z0-9_].
> \W is negation of \w. i.e. it matches anything except word character.