使用正则表达式将{t}文件拆分为多个新文件

时间:2017-02-07 05:10:47

标签: python regex

我正在呼吁Stack Overflow的集体智慧,因为我最终想知道如何做到这一点,而且我是一个新手自学成才。

我有一个编辑信件的txt文件,我需要将其拆分成各自的文件。

这些文件的格式都与以下相同:

For once, before offering such generous but the unasked for advice, put yourselves in...

Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...

Why is it that The Times does not urge totalitarian Arab slates and terrorist...

PAUL STONEHILL Los Angeles

There you go again. Your editorial again makes groundless criticisms of the Israeli...

On Dec. 7 you called proportional representation “bizarre," despite its use in the...

Proportional representation distorts Israeli politics? Huh? If Israel changes the...

MATTHEW SHUGART Laguna Beach

Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...

Although the mayor did not support Proposition U (the slow-growth initiative) his...

If West Los Angeles is any indication of the no-growth policy, where do we go from here?

MARJORIE L. SCHWARTZ Los Angeles

我认为最好的解决方法是尝试使用正则表达式来识别以大写字母开头的行,因为这是真正告诉一个字母结束而另一个字母开始的唯一方法。

我尝试了很多不同的方法,但似乎没有什么工作做得很好。我见过的所有其他答案都是基于可重复的线条或单词。 (例如,此处发布的答案how to split single txt file into multiple txt files by Python和此处Python read through file until match, read until next pattern)。当我必须调整它以接受我所有大写词的正则表达时,这似乎都不起作用。

我最接近的是下面的代码。它创建了正确数量的文件。但是在创建第二个文件后,一切都会出错。第三个文件是空的,其余所有文本都是乱序和/或不完整的。应该在文件4中的段落在文件5或文件7等中或完全丢失。

import re
thefile = raw_input('Filename to split: ')
name_occur = [] 
full_file = []
pattern = re.compile("^[A-Z]{4,}")

with open (thefile, 'rt') as in_file:
    for line in in_file:
        full_file.append(line)
        if pattern.search(line):
            name_occur.append(line) 

totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)

while letters <= totalFiles:
    f1 = open(thefile + '-' + str(letters) + ".txt", "a")
    doIHaveToCopyTheLine = False
    ignoreLines = False
    for line in full_file:
        if not ignoreLines:
            f1.write(line)
            full_file.remove(line)
        if pattern.search(line):
            doIHaveToCopyTheLine = True
            ignoreLines = True
    letters += 1
    f1.close()

我愿意完全废弃这种方法,并采用另一种方式(但仍在Python中)。任何帮助或建议将不胜感激。请假设我是一个没有经验的新手,如果你非常棒,可以抽出时间来帮助我。

3 个答案:

答案 0 :(得分:1)

虽然另一个答案是合适的,但您可能仍然对使用正则表达式分割文件感到好奇。

   smallfile = None
   buf = ""
   with  open ('input_file.txt', 'rt') as f:
      for line in f:
          buf += str(line)
          if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
              if smallfile:
                  smallfile.close()
              match = re.findall(r'^([A-Z\s\.]+\b)' , line)
              smallfile_name = '{}.txt'.format(match[0])
              smallfile = open(smallfile_name, 'w')
              smallfile.write(buf)
              buf = ""
      if smallfile:
          smallfile.close()

答案 1 :(得分:1)

答案 2 :(得分:0)

我采用了更简单的方法并避免了正则表达式。这里的策略主要是计算前三个单词中的大写字母,并确保它们通过某些逻辑。我去的第一个单词是大写,第二个或第三个单词也是大写,但你可以根据需要调整它。然后,这会将每个字母写入与原始文件同名的新文件(注意:它假定您的文件具有.txt等扩展名)但附加了增量整数。试一试,看看它是如何为你工作的。

import string

def split_letters(fullpath):
    current_letter = []
    letter_index = 1
    fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)

    with open(fullpath, 'r') as letters_file:
        letters = letters_file.readlines()
    for line in letters:
        words = line.split()
        upper_words = []
        for word in words:
            upper_word = ''.join(
                c for c in word if c in string.ascii_uppercase)
            upper_words.append(upper_word)

        len_upper_words = len(upper_words)
        first_word_upper = len_upper_words and len(upper_words[0]) > 1
        second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
        third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
        if first_word_upper and (second_word_upper or third_word_upper):
            current_letter.append(line)
            new_filename = '{0}{1}.{2}'.format(
                fullpath_base, letter_index, fullpath_ext)
            with open(new_filename, 'w') as new_letter:
                new_letter.writelines(current_letter)
            current_letter = []
            letter_index += 1

        else:
            current_letter.append(line)

我在你的样本输入上进行了测试,结果很好。