删除段落并将所有内容保存到一行

时间:2016-06-01 04:55:12

标签: python regex python-2.7

您好,我不知道如何解释,我有这个问题。目前我有一些文字如下所示:

 picture gallery 
    see also 
    adaptation
    ecology
    extreme environment clothing
    extremophile
    lexen life in extreme environments
    natural environment
    references 
    "extreme environment" microbial life np nd web 16 may 2013
    feminism and gis refers to the use of geographic information system gis for feminist research and also how women influence gis at technological stages feminist gis research is aware of power differences in social and economic realms
     history 

我的问题是如何制作它,直到我得到这样的结果为例:

picture gallery see also adaptation ecology extreme environment clothing extremophile lexen life in extreme environments natural environment references "extreme environment" microbial life np nd web 16 may 2013 feminism and gis refers to the use of geographic information system gis for feminist research and also how women influence gis at technological stages feminist gis research is aware of power differences in social and economic realms

我不确定这是什么叫,但到目前为止我找到的解决方案是删除所有不是我需要的空格。

请帮帮我。

谢谢。

3 个答案:

答案 0 :(得分:4)

如果是文件:

text = file('path/to/your/file.txt').read()
new_text = text.replace('\n', ' ')
print(new_text) # this will have no new lines
with open('output.txt', 'wr') as out:
    out.write(new_text) #this will write it to a file

你也可以使用正则表达式,就像PJSCopeland说的那样:

import re
s = "Example String \n more example string"
replaced = re.sub('\s+', ' ', s)
print replaced

Dilettant的解决方案简洁,正确且比使用正则表达式(通过我的测量)更快,因此我建议将其作为最佳解决方案:

filtered = ' '.join(text.strip().split())

答案 1 :(得分:2)

请注意,答案中给出的结果在右边缘非常有创意,它从输入数据中删除了历史记录;-) 更新:最新评论表明,数据来自文件,因此更新答案。

将此视为一个不必要的小故障,我建议 使用正则表达式替换。简单地一次性执行strip-split-join转换(假设文本在文件in.txt中的文件夹中):

#! /usr/bin/env python

with open('in.txt', 'rt') as f:
    filtered = ' '.join(f.read().strip().split())

或 - 如果已经在变量中(并且期望和比较为最小测试):

#! /usr/bin/env python

text = '''picture gallery 
    see also 
    adaptation
    ecology
    extreme environment clothing
    extremophile
    lexen life in extreme environments
    natural environment
    references 
    "extreme environment" microbial life np nd web 16 may 2013
    feminism and gis refers to the use of geographic information system gis for feminist research and also how women influence gis at technological stages feminist gis research is aware of power differences in social and economic realms
     history 
'''

expected = (
    'picture gallery see also adaptation ecology extreme environment'
    ' clothing extremophile lexen life in extreme environments'
    ' natural environment references "extreme environment" microbial'
    ' life np nd web 16 may 2013 feminism and gis refers to the use'
    ' of geographic information system gis for feminist research and'
    ' also how women influence gis at technological stages feminist'
    ' gis research is aware of power differences in social and'
    ' economic realms history')

filtered = ' '.join(text.strip().split())

assert filtered == expected

如果您需要在“一行”结果的末尾添加换行符,您可以改为编写:

filtered = '%s\n' % (' '.join(text.strip().split()),)

filtered = ' '.join(text.strip().split()) + '\n'

在这种情况下,当然应该同步更改断言或预期变量; - )

这也应该是一个逻辑上清晰的解决方案。正则表达式通常很诱人,但如果使用像这样的简单拆分连接管道,结果是可行的,它们会导致一些运行时复杂性(以及嵌入的另一种语言)。

只需使用上述设置进行测量,然后使用适用于正则表达式的设置:

print 'strip-split-join:  ', ['%0.4f' % round(z, 4) for z in timeit.Timer("filtered = ' '.join(text.strip().split())", setup=setup).repeat(7, 1000)]
print 're.sub("\s+", " "):', ['%0.4f' % round(z, 4) for z in timeit.Timer("filtered = replaced = re.sub('\s+', ' ', text)", setup=setup_re).repeat(7, 1000)]

这给了(在我的机器上):

strip-split-join:   ['0.0043', '0.0045', '0.0047', '0.0046', '0.0043', '0.0040', '0.0045']
re.sub("\s+", " "): ['0.0265', '0.0254', '0.0246', '0.0248', '0.0238', '0.0255', '0.0266']

所以正则表达式解决方案慢了大约。因子为5.

答案 2 :(得分:1)

/\s+/g(至少一个空格字符的每个实例)替换为" "。 (不幸的是,我不熟悉Python,所以我不知道方法调用会是什么。)