带有特殊字符的大数据自由文本 - 通过Python搜索并给出unicode错误

时间:2015-02-10 06:27:15

标签: regex python-3.x unicode

每条记录之间带有特殊字符和行间距的自由文本,无法搜索关键字。我有3列的大文本文件(每列用“|”分隔。似乎每个记录以}符号结尾。每行或记录之间有一个行间距。我的文件大小约为100 MB + 我的目标是在关键词之前和之后搜索多个关键词和周围词。 有了堆栈溢出帮助,我使用此代码,但我收到Unicode错误。请帮忙。

1.我想得到积极的结果。或者,如果搜索不匹配,我不希望看到任何数据。

2.是否有可能看到每个发现的前4列以及结果?这四列是固定长度,每条记录都相同。

我的档案样本:

00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG| {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0  Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;\red255\green0\blue0 
;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}{\s2\cf0\cb1 
;}}
\par\par\par\b 
FOLLOW-According to the United States Census Bureau, the township has a    total area of 15.1 square miles 

(39 km2), of which, 14.6 square miles (38 km2) of it is land and 0.5 square   miles (1.3 km2) of it 

(3.58%) is water. It is drained by the Lehigh River on its western   \clvertalt\cellx4320
\pard\intbl\s0\ql\widctlpar\plain\f1\fs20\lang4105\f1\fs16 3.87 10^6/uL  \cell
\pard\s0\ql\widctlpar\plain\f1\fs20\par\par\b ASSESSMENT:\plain\f1\fs20    Perfect 
As of the census[1] of 2000, there were 4,243 people, 1,671 households, and  1,256 families residing in 

the township. The population cc:\tab Dhar xdfsd,  MD\par\par\par\par\pard\s0\ql\par}

00010007308000003141|730100040|2007-11-27 10:05:09.000|ACCG|   {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0  Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0  ;\red255\green0\blue0 
;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph Font;} {\s2\cf2\cb1 
;}{\s3\f1\fs22\cf2\cb1\tqc\tx4320\tqr\tx8640 header;}   {\s4\fs20\cf2\cb1\tqc\tx4320\tqr\tx8640 
footer;}}
   \pgwsxn12240\pghsxn15840\marglsxn864\margrsxn864\margtsxn1440\margbsxn864\headery1440\footery864\sbkpage

\pgncont\pgndec
\plain\plain\f1\fs20\pard\par\pard\s3\tqc\tx4320\tqr\tx8640\qc\widctlpar\f0\fs28    \caps 
There were 1,671 households out of which 28.8% had children under the age of  18 living with them, 64.0% 

were married couples living together, 6.9% had a female householder with no  husband present, census 

24.8% were non-families. 19.5% of all households were made up of 
30094 - (770) 761-7260 - FAX (678) 413   -1818\par\lang1024\f0\fs20\par\pard\plain\f1\fs20\par\ql\par\par
}

00010007308000003141|730100036|2007-11-19 12:36:28.000|ACCG| {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red255\green0\blue0 ;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1440\margb864\headery1440\footer y864\deftab720\formshade

\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd

\pgwsxn12240\pghsxn15840\marglsxn864\margrsxn864\margtsxn1440\margbsxn864\headery1440\footery864\sbkpage

\pgncont\pgndec
\plain\plain\f1\fs20\lang1033\f1 Home Care Note:  CMN received from Home  Medical 
In the township the population was spread out with 21.4% under the age of  18, 6.5% from 18 to 24, 29.9% 

from 25 to 44, 27.7% from 45 to 64, and 14.6% who were 65 years of age or  older. The median age was 40 

years. For every 100 females there were 101.1 males. For every 100 females  age 18 and over, there were 

98.5 males
on RA on the 18th of Oct.  Cont. O2 at 2L/N/C was ordered.   \plain\f1\fs20\par}

00010007308000003141|730100037|2007-11-15 12:05:02.000|ACCG|Clear Document - Certificate

00010007308000003141|730100038|2007-11-28 08:35:18.000|ACCG {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \census \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;\red255\green0\blue0 
;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}}
 \paperw12240\paperh15840\margl864\margr864\margt1440\margb864\headery1440\footery864\deftab720\formshade

\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
called and faxed to Mike.\plain\f1\fs20\par}

在上面的文件中,我正在搜索“人口普查”(不区分大小写),我在4个地方找到了匹配。 (第1次记录2次,2次不同记录2次)

所需的输出低于......

00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG|United States Census Bureau, the t
00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG|of the census[1] of 2000
00010007308000003141|730100040|2007-11-27 10:05:09.000|ACCG|husband present, census 24.8% were 
00010007308000003141|730100038|2007-11-28 08:35:18.000|ACCG|fonttbl{\f0 \census \fcharset0 Times

在上述所需的例子中,我确实选择在人口普查之前和之后只显示两个单词。如果我可以灵活地选择2个以上的单词,那将会很棒。例如,之前的10个单词和之后的15个单词等。

我也是从文本文件中读到这个。如果你给我一个读取和写回文本文件的命令,那将是很棒的。对不起,我是Python新手,但我喜欢Python的强大功能。

非常感谢你的帮助。

2 个答案:

答案 0 :(得分:0)

s = r"""4200011|4200002|2006-12-28 10:28:42.000|{\rtf1\ansi

   ^^

试试这个。或者你必须双重逃避\plain\f1\fs20\par

答案 1 :(得分:0)

您可以使用以下正则表达式。

>>> s = r"""4200011|4200002|2006-12-28 10:28:42.000|{\rtf1\ansi 
\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss  
\fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph  
Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1152\margb720\head 
ery1152\footery720\deftab720\formshade\aendnotes\aftnnrlc
Called Brian with mike 
\pgbrdrhead
12/27/06 fax 293-4812\plain\f1\fs20\par}

4200011|4200007|2010-11-29 12:49:42.000|{\rtf1\ansi 
\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss  
\fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red255\green0\blue0 ;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph  
Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1007\margb576\head 
ery1007\footery576\deftab720\formshade\aendnotes\aftnnrlc 
\pgbrdrhead them  numbers and  they pt
minutes\plain\f1\fs20\par}"""
>>> ls = re.findall(r'^(\d+\|\d+)\|(?:(?!\n\n)[\s\S])*?(\S+\s+\S+\s+mike\s+\S+\s+\S+)', s)
>>> print(('|'.join([j for i in ls for j in i])).replace('\n',' '))
4200011|4200002|Brian with mike  \pgbrdrhead 12/27/06