我正在使用python,我正在寻找一种方法,我可以将这些单词排列在一个完整的意义上,并且可以提高可读性。 示例单词
H o w d o s m a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n
输出
小农户如何适应世界粮食生产的大局?
这种去除一次白色空间的方法,这条线有两个空格,它将保留一个。
任何人都可以提出更多方法。
修改
参见此文字行
Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns
这是我的问题所以我根本无法删除空格。一个创意将每组字符与字典匹配并找到写单词。可能是
答案 0 :(得分:7)
import re
a = 'H o w d o sm a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n'
s = re.sub(r'(.) ',r'\1',a)
print(s)
How do smallholder farmers fit into the big picture of world foodproduction
答案 1 :(得分:1)
你可以采取每2个字符然后剥离空格或为那些应该是空间的空间附加一个空间....
nil
答案 2 :(得分:0)
Edit_2:**问题已经改变,有点棘手。我让这个回答最后一个问题,但它不是真正的问题
当前问题
Inn ovative b meines s m odels and financi m echanism for pv de ploym in em ergi ng regio ns
我建议你使用一些real word dictionnary。这是一个SO线程。
然后,您可以使用空格来表达您的句子(此处Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns
)和split
(看起来,您只有这个字符的共同点)。
这是伪代码解决方案:
iterating through the string list:
keeping the currWord index
while realWord not found:
checking currWord in dictionnary.
if realWord is not found:
join the nextWord to the currWord
else:
join currWord to the final sentence
执行此操作并保持您所处的currWord索引,您可以在log
遇到问题,并为您的单词拆分添加一些新规则。如果达到某个阈值,你可能知道你有问题(例如:30个字符长的单词?)。
最后一个问题
编辑:你是对的@Adelin,我应该评论一下。
如果可以的话,这是一个更简单的程序,您可以在其中了解正在发生的事情和/或您是否不喜欢使用正则表达式来处理简单的统一案例:
def raw_char_to_sentence(seq):
""" Splits the "seq" parameter using 'space'. As words are separated with two spaces,
"raw_char_to_sentence" transforms this list of characters into a full string
sentence.
"""
char_list = seq.split(' ')
sentence = ''
word = ''
for c in char_list:
# Adding single character to current word.
word += c
if c == '':
# If word is over, add it to sentence, and reset the current word.
sentence += (word + ' ')
word = ''
# This function adds a space at the end, so we need to strip it.
return sentence.rstrip()
temp = "H o w d o s m a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
print raw_char_to_sentence(temp)
# outputs : How do smallholder farmersfit into the big picture of world
答案 3 :(得分:0)
首先得到一个单词列表(也就是词汇表)。例如。 nltk.corpus.words
:
>>> from nltk.corpus import words
>>> vocab = words.words()
或
>>> from collections import Counter
>>> from nltk.corpus import brown
>>> vocab_freq = Counter(brown.words()
将输入转换为无空格字符串
>>> text = "H o w d o sm a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
>>> ''.join(text.lower().split()) 'howdosmallholderfarmersfitintothebigpictureofworldfoodproduction'
假设:
代码:
from collections import Counter
from nltk.corpus import brown
text = "H o w d o s m a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
text = "Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns"
s = ''.join(text.lower().split())
vocab_freq = Counter(brown.words())
max_word_len = 10
words = []
# A i-th pointer moving forward.
i = 0
while i < len(s):
for j in reversed(range(max_word_len+1)):
# Check if word in vocab and frequency is > 0.
if s[i:i+j] in vocab_freq and vocab_freq[s[i:i+j]] > 0:
words.append(s[i:i+j])
i = i+j
break
[OUT]:
how do small holder farmers fit into the big picture of world food production
假设2严重依赖于您拥有的语料库/词汇表,因此您可以组合更多语料库以获得更好的结果:
from collections import Counter
from nltk.corpus import brown, gutenberg, inaugural, treebank
vocab_freq = Counter(brown.words()) + Counter(gutenberg.words()) + Counter(inaugural.words()) + Counter(treebank.words())
text = "Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns"
s = ''.join(text.lower().split())
max_word_len = 10
words = []
# A i-th pointer moving forward.
i = 0
while i < len(s):
for j in reversed(range(max_word_len+1)):
print(s[i:i+j])
# Check if word in vocab and frequency is > 0.
if s[i:i+j] in vocab_freq and vocab_freq[s[i:i+j]] > 0:
words.append(s[i:i+j])
i = i+j
break
[OUT]:
innovative business models and financing mechanisms for p v deployment in emerging regions