我试图在python中获取文本文件段落的开始和结束偏移量。我尝试使用下面的代码,它给出了开始和结束偏移,但如果段落以空格或制表符开头,则它不会将其视为段落。
paraStartOffset = []
paraEndOffset = []
for match in re.finditer(r'(?s)((?:[^\n]?)+)', textFile):
paraStartOffset.append(match.start())
paraEndOffset.append(match.end())
print "start Offset --> ",paraStartOffset
print "end Offset --> ",paraEndOffset
有人可以指导我,因为我错过了什么。感谢。
答案 0 :(得分:0)
我认为这个question / answer几乎可以讨论你在寻找什么。 代码(取自答案)几乎可以工作,如果我在段落的开头用前导空格进行测试。
for match in re.finditer(r'(?s)((?:[^\n][\n]?)+)', DATA):
print match.start(), match.end()
当我在我的测试文本上运行它时返回以下内容(取自Bram Stoker's Dracula)第一段是标准。第二个从SPACES开始。第三个是从TAB开始的。
结果:(显示每个段落的开始,结束偏移)
0 630
631 1029
1030 1125
测试文字:(我无法获得完全原始的格式,但无论如何......)
_3 May. Bistritz._--Left Munich at 8:35 P. M., on 1st May, arriving at
Vienna early next morning; should have arrived at 6:46, but train was an
hour late. Buda-Pesth seems a wonderful place, from the glimpse which I
got of it from the train and the little I could walk through the
streets. I feared to go very far from the station, as we had arrived
late and would start as near the correct time as possible. The
impression I had was that we were leaving the West and entering the
East; the most western of splendid bridges over the Danube, which is
here of noble width and depth, took us among the traditions of Turkish
rule.
"My Friend.--Welcome to the Carpathians. I am anxiously expecting
you. Sleep well to-night. At three to-morrow the diligence will
start for Bukovina; a place on it is kept for you. At the Borgo
Pass my carriage will await you and will bring you to me. I trust
that your journey from London has been a happy one, and that you
will enjoy your stay in my beautiful land.
Just before I was leaving, the old lady came up to my room and said in a
very hysterical way: