Question

我想从一定数量的字符中取出一段文字并尽可能多地提取单词。我可以用什么工具/库来实现这个目标？

例如，在given文本块中：

Have you managed to get your hands on Nikon's elusive D4 full-frame DSLR? 
It should be smooth sailing from here, with the occasional firmware update being 
your only critical acquisition going forward. D4 firmware 1.02 brings a handful of 
minor fixes, but if you're in need of any of the enhancements listed below, it's 
surely a must have:

如果我要将其分配给一个字符串，然后再创建string = string[0:100]，那么前100个字符就会出现，但“航行”这个词会被切换为“sailin”，我想要在“航行”前的空格之前或之后切断文本。

Answer 1

使用正则表达式：

>>> re.match(r'(.{,100})\W', text).group(1)
"Have you managed to get your hands on Nikon's elusive D4 full-frame DSLR? It should be smooth"

此方法可让您搜索单词之间的任何标点符号（不仅是空格）。它将匹配100或更少的字符。

要处理小字符串，以下正则表达式更好：

re.match(r'(.{,100})(\W|$)', text).group(1)

Answer 2

如果你真的想要打破空格上的字符串，那么使用它：

my_string = my_string[:100].rsplit(None, 1)[0]

但请记住，你可能实际上想要的不仅仅是空格。

Answer 3

如果有的话，这将在前100个字符的最后一个空格处将其剪掉。

lastSpace = string[:100].rfind(' ')
string = string[:lastSpace] if (lastSpace != -1) else string[:100]

提取一定数量字符之间的所有完整单词

3 个答案: