Question

尝试使用nltk标记python中的句子，除了我也想标记\ n和\ t字符。

示例：

在：“这是一个\ n测试”

输出：['This'，'is'，'a'，'\ n'，'test']

是否有直接支持的方法？

Answer 1

您可以使用regex：

import re

text = "This is a\n test with\talso"
pattern = re.compile('[^\t\n]+|[\t\n]+')

output = [val for values in map(pattern.findall, text.split(' ')) for val in values]
print(output)

输出

['This', 'is', 'a', '\n', 'test', 'with', '\t', 'also']

这个想法是先在单个空格上分割，然后对分割产生的列表中的每个元素应用findall。模式[^\t\n]+|[\t\n]+多次匹配不是制表符或换行符的所有内容，并且多次匹配是新行或制表符的所有内容。如果要将每个制表符和换行符视为一个标记，请将模式更改为：

import re

text = "This is a\n test\n\nwith\t\talso"
pattern = re.compile('[^\t\n]+|[\t\n]')
output = [val for values in map(pattern.findall, text.split(' ')) for val in values]
print(output)

输出

['This', 'is', 'a', '\n', 'test', '\n', '\n', 'with', '\t', '\t', 'also']

标记字符串中的\ n和\ t字符

1 个答案: