在大写字母

时间:2016-03-05 22:35:02

标签: python regex

使用Python,我必须编写一个基本上“清理”数据文本文件的脚本。到目前为止,我已经删除了所有不需要的字符或用可接受的字符替换它们(例如,短划线-可以替换为空格)。现在我已经到了必须分开连接在一起的单词的地步。这是文本文件前15行的片段

AccessibleComputing  Computer accessibility
AfghanistanHistory  History of Afghanistan
AfghanistanGeography  Geography of Afghanistan
AfghanistanPeople  Demographics of Afghanistan
AfghanistanCommunications  Communications in Afghanistan
AfghanistanMilitary  Afghan Armed Forces
AfghanistanTransportations  Transport in Afghanistan
AfghanistanTransnationalIssues  Foreign relations of Afghanistan
AssistiveTechnology  Assistive technology
AmoeboidTaxa  Amoeba
AsWeMayThink  As We May Think
AlbaniaHistory  History of Albania
AlbaniaPeople  Demographics of Albania
AlbaniaEconomy  Economy of Albania
AlbaniaGovernment  Politics of Albania

我想要做的是分隔在出现大写字母的点处连接的单词。例如,我希望第一行看起来像这样:

Accessible Computing  Computer accessibility

脚本必须接受文件输入并将结果写入输出文件。这就是我现在所拥有的,而且根本不起作用! (不确定我是否在正确的轨道上)

import re

input_file = open("C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned2.txt",'r')
output_file = open("C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned3.txt",'w')

for line in input_file:
    if line.contains('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'):
        newline = line.

output_file.write(newline)

input_file.close()
output_file.close()

3 个答案:

答案 0 :(得分:1)

这不是最好的方法,但很简单。

from string import uppercase

s = 'AccessibleComputing Computer accessibility'

>>> ' '.join(''.join(' ' + c if n and c in uppercase else c 
                     for n, c in enumerate(word)) 
             for word in s.split())
'Accessible Computing Computer accessibility'

顺便说一句,这就是你应该如何进行文件读/写:

f_in = "C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned2.txt"
f_out = "C:\\Users\\Lucas\\Documents\\Python\\pagelinkSample_10K_cleaned3.txt"

def func(line):
    processed_line = ... # your line processing function
    return processed_line

with open(f_in, 'r') as fin:
    with open(f_out, 'w') a fout:  
        for line in fin.readlines():
            fout.write(func(line))

答案 1 :(得分:1)

我建议用以下正则表达式分割单词:

import re, os

input_file = 'input.txt'
output_file = 'output.txt'

with open(input_file, 'r') as f_in:
    with open(output_file, 'w') as f_out:
        for line in f_in.readlines():
            p = re.compile(r'[A-Z][a-z]+|\S+')

            matches = re.findall(p, line)
            matches = ' '.join(matches)

            f_out.write(matches+ os.linesep)

假设data.txt包含您在帖子中粘贴的文字,则会打印:

Accessible Computing Computer accessibility
Afghanistan History History of Afghanistan
Afghanistan Geography Geography of Afghanistan
Afghanistan People Demographics of Afghanistan
Afghanistan Communications Communications in Afghanistan
Afghanistan Military Afghan Armed Forces
Afghanistan Transportations Transport in Afghanistan
Afghanistan Transnational Issues Foreign relations of Afghanistan
Assistive Technology Assistive technology
Amoeboid Taxa Amoeba
As We May Think As We May Think
Albania History History of Albania
Albania People Demographics of Albania
Albania Economy Economy of Albania
Albania Government Politics of Albania
...

答案 2 :(得分:0)

你可以这样做:

re.sub(r'(?P<end>[a-z])(?P<start>[A-Z])', '\g<end> \g<start>', line)

这将在每个小写字母大写字母之间插入一个空格(假设您只有英文字符。)