Question

我正在使用python，我正在寻找一种方法，我可以将这些单词排列在一个完整的意义上，并且可以提高可读性。示例单词

H o w  d o  s m a l l  h o l d e r  f a r m e r s  f i t  i n t o  t h e  b i g  p i c t u r e  o f  w o r l d  f o o d  p r o d u c t i o n

输出
小农户如何适应世界粮食生产的大局？

这种去除一次白色空间的方法，这条线有两个空格，它将保留一个。

任何人都可以提出更多方法。

修改

参见此文字行

Inn ovative  b usines s  m odels  and  financi ng  m e chanisms  for  pv  de ploym ent  in  em ergi ng  regio ns

这是我的问题所以我根本无法删除空格。一个创意将每组字符与字典匹配并找到写单词。可能是

Answer 1

import re 

a = 'H o w   d o   sm a l l h o l d e r   f a r m e r s  f i t   i n t o   t h e   b i g   p i c t u r e   o f   w o r l d   f o o d p r o d u c t i o n'

s = re.sub(r'(.) ',r'\1',a)

print(s)

How do smallholder farmers fit into the big picture of world foodproduction

Answer 2

你可以采取每2个字符然后剥离空格或为那些应该是空间的空间附加一个空间....

nil

Answer 3

Edit_2：**问题已经改变，有点棘手。我让这个回答最后一个问题，但它不是真正的问题

当前问题

Inn ovative b meines s m odels and financi m echanism for pv de ploym in em ergi ng regio ns

我建议你使用一些real word dictionnary。这是一个SO线程。

然后，您可以使用空格来表达您的句子（此处Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns）和split（看起来，您只有这个字符的共同点）。

这是伪代码解决方案：

iterating through the string list:
    keeping the currWord index
    while realWord not found:
        checking currWord in dictionnary.
        if realWord is not found:
            join the nextWord to the currWord
        else:
            join currWord to the final sentence

执行此操作并保持您所处的currWord索引，您可以在log遇到问题，并为您的单词拆分添加一些新规则。如果达到某个阈值，你可能知道你有问题（例如：30个字符长的单词？）。

最后一个问题

编辑：你是对的@Adelin，我应该评论一下。

如果可以的话，这是一个更简单的程序，您可以在其中了解正在发生的事情和/或您是否不喜欢使用正则表达式来处理简单的统一案例：

def raw_char_to_sentence(seq):
    """ Splits the "seq" parameter using 'space'. As words are separated with two spaces,
        "raw_char_to_sentence" transforms this list of characters into a full string
        sentence.
    """
    char_list = seq.split(' ')

    sentence = ''
    word = ''
    for c in char_list:
        # Adding single character to current word.
        word += c
        if c == '':
            # If word is over, add it to sentence, and reset the current word.
            sentence += (word + ' ')
            word = ''

    # This function adds a space at the end, so we need to strip it.
    return sentence.rstrip()

temp = "H o w  d o  s m a l l h o l d e r  f a r m e r s f i t  i n t o  t h e  b i g  p i c t u r e  o f  w o r l d  f o o d p r o d u c t i o n"
print raw_char_to_sentence(temp)
# outputs : How do smallholder farmersfit into the big picture of world

Answer 4

首先得到一个单词列表（也就是词汇表）。例如。 nltk.corpus.words：

>>> from nltk.corpus import words
>>> vocab = words.words()

或

>>> from collections import Counter
>>> from nltk.corpus import brown
>>> vocab_freq = Counter(brown.words()

将输入转换为无空格字符串

>>> text = "H o w d o sm a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
>>> ''.join(text.lower().split())                                                                                                      'howdosmallholderfarmersfitintothebigpictureofworldfoodproduction'

假设：

一个单词越久，它就越像一个单词
不在词汇表中的单词不是单词

代码：

from collections import Counter 

from nltk.corpus import brown

text = "H o w d o s m a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
text = "Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns"
s = ''.join(text.lower().split())

vocab_freq = Counter(brown.words())

max_word_len = 10

words = []
# A i-th pointer moving forward.
i = 0
while i < len(s):
    for j in reversed(range(max_word_len+1)):
        # Check if word in vocab and frequency is > 0.
        if s[i:i+j] in vocab_freq and vocab_freq[s[i:i+j]] > 0:
            words.append(s[i:i+j])
            i = i+j
            break

[OUT]：

how do small holder farmers fit into the big picture of world food production

假设2严重依赖于您拥有的语料库/词汇表，因此您可以组合更多语料库以获得更好的结果：

from collections import Counter 

from nltk.corpus import brown, gutenberg, inaugural, treebank

vocab_freq = Counter(brown.words()) + Counter(gutenberg.words()) +  Counter(inaugural.words()) + Counter(treebank.words()) 

text = "Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns"
s = ''.join(text.lower().split())


max_word_len = 10

words = []
# A i-th pointer moving forward.
i = 0
while i < len(s):
    for j in reversed(range(max_word_len+1)):
        print(s[i:i+j])
        # Check if word in vocab and frequency is > 0.
        if s[i:i+j] in vocab_freq and vocab_freq[s[i:i+j]] > 0:
            words.append(s[i:i+j])
            i = i+j
            break

[OUT]：

innovative business models and financing mechanisms for p v deployment in emerging regions

从单词中删除空格并生成确切的单词

4 个答案: