Question

我有utf_8 .txt文件： greek.txt

Blessed is a Man

1. μακάριος
ανήρ
2. ότι
γινώσκει
κύριος

我想得到：greek_r.txt

Blessed is a Man

1. μακάριος ανήρ
2. ότι γινώσκει κύριος

我用过

# -*- coding: utf-8 -*-
import io
import re
f1 = io.open('greek.txt','r',encoding='utf8')
f2 = io.open('greek_r.txt','w',encoding='utf8')

for line in f1:
    f2.write(re.sub(r'\n((?=^[^\d]))', r'\1', line))

f1.close()
f2.close()

但是没有用，有什么想法吗？

Answer 1

您正在逐行阅读输入文件，因此，您的正则表达式无法＆＃34;请参阅＆＃34;跨行，\n是每行中的最后一个字符，而(?=^[^\d])只是没有意义，因为它需要字符串的开头，后面跟一个数字以外的字符。

使用类似：

import re, io
with io.open('greek.txt','r',encoding='utf8') as f1:
    with io.open('greek_r.txt','w',encoding='utf8') as f2:
        f2.write(re.sub(r'\r?\n(\D)', r' \1', f1.read()))

添加\r?以匹配可选的CR符号（如果换行符是Windows样式）。 r'\r?\n(\D)'可以替换为r'(?u)\r?\n([^\W\d_])'，只匹配后跟字母的换行符（[^\W\d_]匹配除非单词以外的任何字符，数字和_字符，即任何字母）。 (?u)是一个内联re.U修饰符版本，用于匹配Python 2.x中的任何Unicode字母（在Python 3中，默认使用它）。

输出：

Blessed is a Man

1. μακάριος ανήρ
2. ότι γινώσκει κύριος

在unicode文件中拆分除了数字之外的新行

1 个答案: