每条记录之间带有特殊字符和行间距的自由文本,无法搜索关键字。我有3列的大文本文件(每列用“|”分隔。似乎每个记录以}符号结尾。每行或记录之间有一个行间距。我的文件大小约为100 MB + 我的目标是在关键词之前和之后搜索多个关键词和周围词。 有了堆栈溢出帮助,我使用此代码,但我收到Unicode错误。请帮忙。
1.我想得到积极的结果。或者,如果搜索不匹配,我不希望看到任何数据。
2.是否有可能看到每个发现的前4列以及结果?这四列是固定长度,每条记录都相同。
我的档案样本:
00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG| {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;\red255\green0\blue0
;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}{\s2\cf0\cb1
;}}
\par\par\par\b
FOLLOW-According to the United States Census Bureau, the township has a total area of 15.1 square miles
(39 km2), of which, 14.6 square miles (38 km2) of it is land and 0.5 square miles (1.3 km2) of it
(3.58%) is water. It is drained by the Lehigh River on its western \clvertalt\cellx4320
\pard\intbl\s0\ql\widctlpar\plain\f1\fs20\lang4105\f1\fs16 3.87 10^6/uL \cell
\pard\s0\ql\widctlpar\plain\f1\fs20\par\par\b ASSESSMENT:\plain\f1\fs20 Perfect
As of the census[1] of 2000, there were 4,243 people, 1,671 households, and 1,256 families residing in
the township. The population cc:\tab Dhar xdfsd, MD\par\par\par\par\pard\s0\ql\par}
00010007308000003141|730100040|2007-11-27 10:05:09.000|ACCG| {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;\red255\green0\blue0
;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph Font;} {\s2\cf2\cb1
;}{\s3\f1\fs22\cf2\cb1\tqc\tx4320\tqr\tx8640 header;} {\s4\fs20\cf2\cb1\tqc\tx4320\tqr\tx8640
footer;}}
\pgwsxn12240\pghsxn15840\marglsxn864\margrsxn864\margtsxn1440\margbsxn864\headery1440\footery864\sbkpage
\pgncont\pgndec
\plain\plain\f1\fs20\pard\par\pard\s3\tqc\tx4320\tqr\tx8640\qc\widctlpar\f0\fs28 \caps
There were 1,671 households out of which 28.8% had children under the age of 18 living with them, 64.0%
were married couples living together, 6.9% had a female householder with no husband present, census
24.8% were non-families. 19.5% of all households were made up of
30094 - (770) 761-7260 - FAX (678) 413 -1818\par\lang1024\f0\fs20\par\pard\plain\f1\fs20\par\ql\par\par
}
00010007308000003141|730100036|2007-11-19 12:36:28.000|ACCG| {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red255\green0\blue0 ;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1440\margb864\headery1440\footer y864\deftab720\formshade
\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd
\pgwsxn12240\pghsxn15840\marglsxn864\margrsxn864\margtsxn1440\margbsxn864\headery1440\footery864\sbkpage
\pgncont\pgndec
\plain\plain\f1\fs20\lang1033\f1 Home Care Note: CMN received from Home Medical
In the township the population was spread out with 21.4% under the age of 18, 6.5% from 18 to 24, 29.9%
from 25 to 44, 27.7% from 45 to 64, and 14.6% who were 65 years of age or older. The median age was 40
years. For every 100 females there were 101.1 males. For every 100 females age 18 and over, there were
98.5 males
on RA on the 18th of Oct. Cont. O2 at 2L/N/C was ordered. \plain\f1\fs20\par}
00010007308000003141|730100037|2007-11-15 12:05:02.000|ACCG|Clear Document - Certificate
00010007308000003141|730100038|2007-11-28 08:35:18.000|ACCG {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \census \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;\red255\green0\blue0
;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1440\margb864\headery1440\footery864\deftab720\formshade
\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
called and faxed to Mike.\plain\f1\fs20\par}
在上面的文件中,我正在搜索“人口普查”(不区分大小写),我在4个地方找到了匹配。 (第1次记录2次,2次不同记录2次)
所需的输出低于......
00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG|United States Census Bureau, the t
00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG|of the census[1] of 2000
00010007308000003141|730100040|2007-11-27 10:05:09.000|ACCG|husband present, census 24.8% were
00010007308000003141|730100038|2007-11-28 08:35:18.000|ACCG|fonttbl{\f0 \census \fcharset0 Times
在上述所需的例子中,我确实选择在人口普查之前和之后只显示两个单词。如果我可以灵活地选择2个以上的单词,那将会很棒。例如,之前的10个单词和之后的15个单词等。
我也是从文本文件中读到这个。如果你给我一个读取和写回文本文件的命令,那将是很棒的。对不起,我是Python新手,但我喜欢Python的强大功能。
非常感谢你的帮助。
答案 0 :(得分:0)
s = r"""4200011|4200002|2006-12-28 10:28:42.000|{\rtf1\ansi
^^
试试这个。或者你必须双重逃避\plain\f1\fs20\par
答案 1 :(得分:0)
您可以使用以下正则表达式。
>>> s = r"""4200011|4200002|2006-12-28 10:28:42.000|{\rtf1\ansi
\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss
\fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph
Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1152\margb720\head
ery1152\footery720\deftab720\formshade\aendnotes\aftnnrlc
Called Brian with mike
\pgbrdrhead
12/27/06 fax 293-4812\plain\f1\fs20\par}
4200011|4200007|2010-11-29 12:49:42.000|{\rtf1\ansi
\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss
\fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red255\green0\blue0 ;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph
Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1007\margb576\head
ery1007\footery576\deftab720\formshade\aendnotes\aftnnrlc
\pgbrdrhead them numbers and they pt
minutes\plain\f1\fs20\par}"""
>>> ls = re.findall(r'^(\d+\|\d+)\|(?:(?!\n\n)[\s\S])*?(\S+\s+\S+\s+mike\s+\S+\s+\S+)', s)
>>> print(('|'.join([j for i in ls for j in i])).replace('\n',' '))
4200011|4200002|Brian with mike \pgbrdrhead 12/27/06