我正在尝试使用Python解析包含公司,材料,化学属性等的数千个规格表文本文件(材料安全数据表,具体而言)。文本文件包含松散结构化格式的类似信息,因此它具有人类可读性,但非结构化,不易解析(例如,不是XML或CSV)。简而言之,它就是到处都是。
最初,数据是由在不同公司工作的不同人员手工输入的。另一组人将信息转录到这些文本文件中(将其OCR转换为txt文件)。
是否有解析库或模式来提取此类信息? (这似乎是一个“常见的”数据输入问题。)当然会使用正则表达式。我对自然语言处理库没有任何经验。他们甚至会适合这个问题吗?
我最初的想法是尝试将文件分组到不同的caegories中,然后为每种格式创建一组解析函数。不幸的是,他只能解决问题的一小部分,不同的案例可能会迅速失控。
由于这个问题一般,我将提供一些说明问题的例子。
地址信息
每个文件都包含公司信息,如信息和地址。信息可能有也可能没有标识符,它可能在一条线上,也可能没有,等等。简而言之,似乎每种组合都有。
例如(带字段信息):
MANUFACTURER: Foo Bar Inc.
ADDRESS: 123 Foo St.
Bar, CA 90012
实施例。 (wo / field info):
Foo Bar Inc.
123 Foo St.
Bar, CA 90012
实施例。 (有时候信息之间有额外的界限):
FOO BAR INC.
123 FOO ST.
BAR, CA 90012
实施例。 (字段名称不一致):
MANUFACTURER'S NAME: FOO BAR INC.
CREATIVE DIVISION
ADDRESS: 123 FOO ST.
CITY, STATE & ZIP: BAR, CALIFORNIA 90012
PHONE NUMBER: 310-111-2222
部分信息
规格表也有类似的部分,但订单,标题,数字类型和分隔符不一致。
例如:
========================================
SECTION 1 -- MATERIALS
========================================
前:
Section I. Materials
------------------------------------------
前:
----- Section 3 Materials
有时文件的宽度已更改,因此以下换行符。
例如:
===================================================
1. Materials
===================================================
变为:
=========================================
==========
1. Materials
=========================================
==========
以下是一个完整的示例:
希望这将澄清解析文件的问题。你会注意到换行,信息在不同的行上分开等等。并非所有都具有确切的结构,有些格式不同,信息在不同的地方。这是a paper hard copy的链接。
MATERIAL SAFETY DATA SHEET
=================================================================
=========
SECTION I-PRODUCT AND PREPARATION INFORMATION
=================================================================
=========
MANUFACTURER: Some Company Inc EMERGENCY AND
INFORMATION
TELEPHONE
(111)222-3333
ADDRESS: Some Road
City, ST
12346
IDENTITY (AS USED ON
LABEL AND LIST): Some Identity
PREPARATION DATE: Some Date
=================================================================
=========
SECTION II-HAZARDOUS INGREDIENTS/IDENTITY INFORMATION
=================================================================
=========
OSHA
ACGIH
HAZARDOUS COMPONENTS CAS# PEL TWA TLV
%
(SPECIFIC CHEMICAL IDENTITY;
COMMON NAME(S)
-----------------------------------------------------------------
---------
Some Chemical 111-22-3 15 10 10
12.34
=================================================================
=========
SECTION III-PHYSICAL/CHEMICAL CHARACTERISTICS
=================================================================
=========
Boiling Point: N/A Specific Gravity (H20=1): N/A
Vapor Pressure (mm Hg): N/A Melting Point: N/A
Vapor Density (AIR=1) N/A Evaporation Rate
(Butyl Acetate=1) N/A
Solubility in Water: None
Appearance: Solid, various colors, may have slight
odor.
N/A = Not applicable
=================================================================
=========
SECTION IV-FIRE AND EXPLOSION HAZARD DATA
=================================================================
=========
FLASH POINT (METHOD USED): None
FLAMMABLE LIMITS: None LEL: N/A UEL: N/A
EXTINGUISHING MEDIA: None
SPECIAL FIRE FIGHTING PROCEDURES: None required.
UNUSUAL FIRE AND EXPLOSION HAZARDS: None.
=================================================================
=========
SECTION V-REACTIVITY DATA
=================================================================
=========
STABILITY: Stable
CONDITIONS TO AVOID: None
INCOMPATIBILITY (MATERIALS TO AVOID): None
HAZARDOUS POLYMERIZATION: Will not occur
=================================================================
=========
SECTION VI-HEALTH HAZARD DATA
=================================================================
=========
ROUTES OF ENTRY:
INHALATION: Yes
SKIN: Possibly
INGESTION: Possibly
EYES: Possibly
HEALTH HAZARDS (ACUTE AND CHRONIC): Pneumoconiosis, silicosis,
emphysema,
nose and throat irritation, eye irritation, skin irritation in
some.
CARCINOGENICITY: No applicable information found.
SIGNS AND SYMPTOMS OF EXPOSURE: Coughing, sneezing; irritation
of the
mucous membranes; eye irritation; skin irritation or rash, dry
throat.
MEDICAL CONDITIONS GENERALLY AGGRAVATED BY EXPOSURE: Nasal,
bronchial or
pulmonary conditions which tend to restrict breathing, skin
abrasions.
EMERGENCY AND FIRST AID PROCEDURES: Remove to fresh air,
irrigate eyes,
wash with soap and water, contact physician if necessary.
=================================================================
=========
SECTION VII-PRECAUTIONS FOR SAFE HANDLING AND USE
=================================================================
=========
STEPS TO BE TAKEN IN CASE MATERIAL IS RELEASED OR SPILLED:
Normal clean-up
procedures.
WASTE DISPOSAL METHOD: Standard landfill methods consistent with
applicable state and federal regulations.
PRECAUTIONS TO BE TAKEN IN HANDLING AND STORING: Use caution not
to drop,
crush, break or chip.
OTHER PRECAUTIONS: Do not use at speeds greater than the
not-to-exceed
speed printed on the hub assembly.
=================================================================
=========
SECTION VIII-CONTROL MEASURES
=================================================================
=========
RESPIRATORY PROTECTION (SPECIFY TYPE): OSHA or NIOSH approved
respirators
may be required.
VENTILATION: Local exhaust recommended. Special: N/A.
Mechanical: Useful. Other: N/A.
PROTECTIVE GLOVES: May be useful.
EYE PROTECTION: Recommended.
OTHER PROTECTIVE CLOTHING OR EQUIPMENT: Not required.
WORK/HYGIENIC PRACTICES: Keep clothing and area clean. Wash to
remove
答案 0 :(得分:2)
我会编写一个包含大量状态变量的for循环,处理每一行,并使用状态变量来跟踪正在发生的事情。 for循环中的条件(if
)会产生人类在手动解析文件时必须要做的“问题”。
"
for line in file:
Is there a colon in line?
field_name = normalize(informaton before the colon)
data = information after the colon
else:
field_name = next_field_in_list(previous_field)
data = line
"
等等。 如果您至少对字段有固定的订单,我无法从示例中理解, 并且每个记录的最大字段数或不同的记录分隔符。没有这些,我认为写起来会更难。