使用Python解析文本:非结构化但具有不同格式的类似信息

时间:2011-04-09 22:53:09

标签: python parsing text-parsing

我正在尝试使用Python解析包含公司,材料,化学属性等的数千个规格表文本文件(材料安全数据表,具体而言)。文本文件包含松散结构化格式的类似信息,因此它具有人类可读性,但非结构化,不易解析(例如,不是XML或CSV)。简而言之,它就是到处都是。

最初,数据是由在不同公司工作的不同人员手工输入的。另一组人将信息转录到这些文本文件中(将其OCR转换为txt文件)。

是否有解析库或模式来提取此类信息? (这似乎是一个“常见的”数据输入问题。)当然会使用正则表达式。我对自然语言处理库没有任何经验。他们甚至会适合这个问题吗?

我最初的想法是尝试将文件分组到不同的caegories中,然后为每种格式创建一组解析函数。不幸的是,他只能解决问题的一小部分,不同的案例可能会迅速失控。

由于这个问题一般,我将提供一些说明问题的例子。

地址信息
每个文件都包含公司信息,如信息和地址。信息可能有也可能没有标识符,它可能在一条线上,也可能没有,等等。简而言之,似乎每种组合都有。

例如(带字段信息):

MANUFACTURER: Foo Bar Inc.  
ADDRESS: 123 Foo St.  
Bar, CA 90012

实施例。 (wo / field info):

Foo Bar Inc.  
123 Foo St.  
Bar, CA 90012

实施例。 (有时候信息之间有额外的界限):

FOO BAR INC.

123 FOO ST.

BAR, CA 90012

实施例。 (字段名称不一致):

MANUFACTURER'S NAME: FOO BAR INC.  
CREATIVE DIVISION  
ADDRESS: 123 FOO ST.  
CITY, STATE & ZIP: BAR, CALIFORNIA 90012  
PHONE NUMBER: 310-111-2222

部分信息
规格表也有类似的部分,但订单,标题,数字类型和分隔符不一致。

例如:

========================================
SECTION 1 -- MATERIALS
========================================

前:

Section I. Materials
------------------------------------------

前:

----- Section 3       Materials

有时文件的宽度已更改,因此以下换行符。

例如:

===================================================
1.    Materials
===================================================

变为:

=========================================
==========
1.    Materials
=========================================
==========

以下是一个完整的示例:
希望这将澄清解析文件的问题。你会注意到换行,信息在不同的行上分开等等。并非所有都具有确切的结构,有些格式不同,信息在不同的地方。这是a paper hard copy的链接。

MATERIAL SAFETY DATA SHEET

=================================================================
=========
SECTION I-PRODUCT AND PREPARATION INFORMATION
=================================================================
=========

MANUFACTURER:         Some Company Inc     EMERGENCY AND
INFORMATION
TELEPHONE
(111)222-3333
ADDRESS:              Some Road
City, ST
12346

IDENTITY (AS USED ON
LABEL AND LIST):      Some Identity

PREPARATION DATE:     Some Date

=================================================================
=========
SECTION II-HAZARDOUS INGREDIENTS/IDENTITY INFORMATION
=================================================================
=========

OSHA
ACGIH
HAZARDOUS COMPONENTS             CAS#       PEL   TWA        TLV
%
(SPECIFIC CHEMICAL IDENTITY;
COMMON NAME(S)
-----------------------------------------------------------------
---------

Some Chemical             111-22-3   15    10         10
12.34


=================================================================
=========
SECTION III-PHYSICAL/CHEMICAL CHARACTERISTICS
=================================================================
=========

Boiling Point:              N/A  Specific Gravity (H20=1):   N/A
Vapor Pressure (mm Hg):     N/A  Melting Point:              N/A
Vapor Density (AIR=1)       N/A  Evaporation Rate
(Butyl Acetate=1)           N/A
Solubility in Water:        None

Appearance:  Solid, various colors, may have slight
odor.

N/A = Not applicable

=================================================================
=========
SECTION IV-FIRE AND EXPLOSION HAZARD DATA
=================================================================
=========

FLASH POINT (METHOD USED):  None
FLAMMABLE LIMITS:  None          LEL:  N/A        UEL:  N/A
EXTINGUISHING MEDIA:  None
SPECIAL FIRE FIGHTING PROCEDURES:  None required.
UNUSUAL FIRE AND EXPLOSION HAZARDS:  None.

=================================================================
=========
SECTION V-REACTIVITY DATA
=================================================================
=========

STABILITY:  Stable
CONDITIONS TO AVOID:  None
INCOMPATIBILITY (MATERIALS TO AVOID):  None
HAZARDOUS POLYMERIZATION:  Will not occur

=================================================================
=========
SECTION VI-HEALTH HAZARD DATA
=================================================================
=========

ROUTES OF ENTRY:

INHALATION:  Yes
SKIN:  Possibly
INGESTION:  Possibly
EYES:  Possibly

HEALTH HAZARDS (ACUTE AND CHRONIC):  Pneumoconiosis, silicosis,
emphysema,
nose and throat irritation, eye irritation, skin irritation in
some.

CARCINOGENICITY:  No applicable information found.

SIGNS AND SYMPTOMS OF EXPOSURE:  Coughing, sneezing; irritation
of the
mucous membranes; eye irritation; skin irritation or rash, dry
throat.

MEDICAL CONDITIONS GENERALLY AGGRAVATED BY EXPOSURE:  Nasal,
bronchial or
pulmonary conditions which tend to restrict breathing, skin
abrasions.

EMERGENCY AND FIRST AID PROCEDURES:  Remove to fresh air,
irrigate eyes,
wash with soap and water, contact physician if necessary.

=================================================================
=========
SECTION VII-PRECAUTIONS FOR SAFE HANDLING AND USE
=================================================================
=========

STEPS TO BE TAKEN IN CASE MATERIAL IS RELEASED OR SPILLED:
Normal clean-up
procedures.

WASTE DISPOSAL METHOD:  Standard landfill methods consistent with
applicable state and federal regulations.

PRECAUTIONS TO BE TAKEN IN HANDLING AND STORING:  Use caution not
to drop,
crush, break or chip.

OTHER PRECAUTIONS:  Do not use at speeds greater than the
not-to-exceed
speed printed on the hub assembly.

=================================================================
=========
SECTION VIII-CONTROL MEASURES
=================================================================
=========

RESPIRATORY PROTECTION (SPECIFY TYPE):  OSHA or NIOSH approved
respirators
may be required.

VENTILATION:  Local exhaust recommended.  Special:  N/A.
Mechanical:  Useful.  Other:  N/A.

PROTECTIVE GLOVES:  May be useful.

EYE PROTECTION:  Recommended.

OTHER PROTECTIVE CLOTHING OR EQUIPMENT:  Not required.

WORK/HYGIENIC PRACTICES:  Keep clothing and area clean.  Wash to
remove

1 个答案:

答案 0 :(得分:2)

我会编写一个包含大量状态变量的for循环,处理每一行,并使用状态变量来跟踪正在发生的事情。 for循环中的条件(if)会产生人类在手动解析文件时必须要做的“问题”。

"
for line in file:
    Is there a colon in line?
        field_name = normalize(informaton before the colon)
        data = information after the colon
    else: 
        field_name = next_field_in_list(previous_field)
        data = line
"

等等。 如果您至少对字段有固定的订单,我无法从示例中理解, 并且每个记录的最大字段数或不同的记录分隔符。没有这些,我认为写起来会更难。