我使用Python脚本导入了纯文本版本的PDF,但它有一堆我不关心的垃圾工件。
我唯一关心的空白是(1)单个空格,以及(2) double \ n's。
单一空格,原因很明显,在字边界之间。 Double \ n's,在段落之间划分界限。
它包含的垃圾空格如下所示:
[\ \n\t]+ all jumbled together
这引出了另一个问题,有时段落由
划分[\n][\s]+[\n]
我对正则表达式的经验不足以使它忽略两个\n
之间的内部空白。作为一个业余的RegExer,我的问题是\s
包括\n
。
如果没有 - 我认为这将是一个非常容易解决的问题。
所有其他空白区域都是无关紧要的,我正在尝试的任何东西都无法正常工作。
非常感谢任何建议。
Summary: The Department of Environment in Bangladesh seized 265 sacks of poultry feed
tainted with tannery waste and various chemicals.
Synthesis/Analysis: The Department of Environment seized the tainted poultry feed on
28 March from a house in the city of Adabar located in Dhaka province. Workers were
found in the house, which was used as an illegal factory, producing the tainted feed. The
Bangladesh Environment Conservation Act allowed for a case to be filed against the
factory’s manager, Mahmud Hossain, and the owner, who was not named.
It was reported that the Department of Environment had also closed three other factories
in Hazaribag a month prior to this instance for the same charges. The Bangladesh Council of
Scientific and Industrial Research found that samples from the feed taken from these
factories had “dangerous levels of chromium…” The news report also stated that “poultry
6
and eggs became poisonous” from consuming the tainted feed, which would also cause
health concerns for consumers.
这只是引导我进行更多修复...要删除所有页码,并随机加倍\ n。
答案 0 :(得分:3)
您可以使用断言使\s
排除换行符:
((?!\n)\s){2,}
要将换行符与其间的\n\s+\n
个空格合并,您可以使用类似的构造来代替\s+
。但为简单起见,我只使用两个preg_match
es并首先合并换行符,然后清理双倍空格。
答案 1 :(得分:1)
我认为这会奏效:
将“\ s * \ n \ s * \ n \ s *”替换为“\ n \ n”。
这将标准化段落分隔符。
将“\ s * \ s *”替换为“”。
这将使单词分隔符标准化。
答案 2 :(得分:0)
s/(\h)+/$1/g;
s/\n(\s*\n)+/\n\n/g;
以上将删除水平空格(我假设您不担心垂直制表符或回车符)和所有换行符都超过双倍空格。
答案 3 :(得分:0)
曾经如此轻微的hacky,但我认为这应该可以解决问题:
preg_replace('/\t|( ){2,}|(\n)\s+(\n)/', '\1\2\3', $data);
奖励:不需要对字符串进行两次传递。