我正在开展一个项目,我必须将正确的句子与文本语料库分开。我尝试过使用NLTK句子标记符,但它似乎基于句点("。")来标记句子。
所以我在想是否有办法将表格数据,短语与文本文件分开?
这是一个示例文本文件。我在TEXT标签下引用它们。
<?xml version='1.0' encoding='UTF-8'?>
<root>
<TEXT><![CDATA[
Record date: 2078-09-07
RYBURY HOSPITAL INTERN ADMISSION NOTE
Name: Goldberg, Joel
MR #: 0370149
Date of admission: 9/6/2078
Resident: Lange/Bailey
Attending: Schmidt MD
PCP: Odom, Kacie MD
CC: L foot pain
HPI: The patient is a 48 yo gentleman with a hx of DM2, peripheral neuropathy and PVD with multiple admissions for LE cellulitis in the setting of gangrenous toes in the past 5 years, last one in July. He now presents with acute on chronic LLE sweeling that began this morning after he got up walked around his home for about 2-3 hrs and then suddenly felt an acute pain shooting up his leg, with a severity of 10/10, he knew right away this was similar to the pain he had felt before on prior admissions for cellulitis so he called 911. On arrival to the ED his temp was 98.1, 112, 145/79, 20, 99%RA and was started on antibiotic treatment with Unasyn for cellulitis.
ROS: Per HPI. No F/C/NS. No CP/Palps. No Orthopnea. No SOB/cough/hemoptysis/wheezing/sore throat/. No hematochezia/melena. No delta MS/LOC. No slurring of speech, unilateral weakness. No dysuria. No chills or fevers, no lightheadedness.
PMH:
1. DM2 diagnosed in 2075, says peripheral neuropathy was diagnosed around the same time, denies any retinopathy or nephropathy.
2. Peripheral vascular disease with the following surgeries performed:
Right 5th toe amputation 2/2 osteomyelitis 12/14/76
Right 4th toe amputation 2/2 wet gangrene 9/03/76
Angioplasty and stenting of the distal LEFT superficial femoral artery 11/6/76
Angioplasty and stenting of the distal RIGHT superficial femoral artery 7/20/76
I&D of right thigh abscess 4/75
Medications on admission (confirmed with patient):
1. Glyburide 2.5mg BID
2. Glucopahge 500mg QD
3. Zestril 2.5mg QD
4. Percocet PRN
ALL: Codeine upsets his stomach
SH: Lives in Arroyo Grande apartment with friend, works occasionally as a copy editor but unemployed right now, has smoked 1/2ppd for 35 years, no ETOH, no drugs. Adequate diet.
FH: Many family members with DM.
Physical Exam:
V: 98.5, 149/84, 98, 18, 99%RA
Gen: NAD, conversant
HEENT: PERRL, EOMI.
Neck: Supple, no thyromegaly, no carotid bruits, JVP
Nodes: No cervical or supraclavicular LAN
Cor: RRR S1, S2 nl. No m/r/g. No S3, S4
Chest: CTAB
Abdomen: +BS Soft, NT, ND. No HSM, No CVA tenderness.
Ext: LLE with dorsal and medial erythema, extending from L 5th toe that has eschar on its side and is mildly tender, no secretions. L toe also tender. Pulse on LLE + and RLE ++.
Skin: No other rashes
Neuro: AO X 3. CN II-XII intact. Decreased sensation from LT up to knee on R and 4cm above ankle on Left..
Labs and Studies:
RSC
09/06/78
10:17
NA 137
K 4.5(T)
CL 106
CO2 28.4
BUN 25
CRE 0.9
GLU 266(H)
CA 9.5
PHOS 3.1
MG 1.7
CBC
WBC 12.7(H)
RBC 4.13(L)
HGB 13.0(L)
HCT 37.1(L)
MCV 90
MCH 31.5
MCHC 35.0
PLT 165
RDW 13.3
DIFFR Received
METHOD Auto
%NEUT 79(H)
%LYMPH 17(L)
%MONO 3(L)
%EOS 1
%BASO 0
ANEUT 10.02(H)
ALYMP 2.13
AMONS 0.44(H)
AEOSN 0.11
ABASOP 0.03
ANISO None
HYPO None
MACRO None
MICRO None
PT 11.9
PTT 25.0
LENIS: Negative for DVT, did not assess arteries.
FOOT ANKLE XR: There is a lytic lesion in the distal lateral aspect of the proximal phalanx of the fifth toe. This can be consitent with an area of infection/osteomyelitis.
Microbiology
21-Jul-2076 09:41
Specimen Type: WOUND
Specimen Comment: ULCER 4TH 5TH TOE
Wound Culture - Final Reported: 24-Jul-76 15:05
Moderate PROTEUS VULGARIS
RAPID METHOD
Antibiotic Interpretation
----------------------------------------------
Amikacin Susceptible
Ampicillin Resistant
Aztreonam Susceptible
Cefazolin Resistant
Cefepime Susceptible
Cefpodoxime Susceptible
Ceftriaxone Susceptible
Gentamicin Susceptible
Levofloxacin Susceptible
Piperacillin Susceptible
Trimethoprim/Sulfamethoxazole Susceptible
A/P: 48M with a hx of DM2, PVD and multiple admissions in the past for LE cellulitis in the setting of gangrene.
1. ID: Patient is now presenting with appears to be another episode of cellulitis but now probably coming from his L 5th Toe lesion. Surgery has debrided the wound, sending wound cultures as well as blood cultures. Acute OM would not be visible on XR changes and clinical picture is more consistent with acute than Chronic OM. Will consider further work up for OM if symptoms do not respond to treatment. Levo and flagyl were added to unasyn in accord to previous culture data.
2. PVD: Will need arterial LENIS to assess for vascular patency and flow. Continuing ACEI, and adding ASA and lipitor, will order lipid profile and smoking cessation consult.
3. DM2: Very poor control last admission, eventhough patient now says he takes medications and checks it up to QID. Will order HgbA1C and glucose monitoring.
_______________________________________________________________________
Name Ian Jurado MD
Pager # 14558
PGY-1
]]></TEXT>
<TAGS>
<MEDICATION id="DOC0" time="during DCT" type1="ACE inhibitor" type2="">
<MEDICATION id="M0" start="1782" end="1789" text="Zestril" time="during DCT" type1="ACE inhibitor" type2="" comment=""/>
<MEDICATION id="M1" start="1782" end="1789" text="Zestril" time="during DCT" type1="ACE inhibitor" type2="" comment=""/>
<MEDICATION id="M2" start="1782" end="1789" text="Zestril" time="during DCT" type1="ACE inhibitor" type2="" comment=""/>
</MEDICATION>
<MEDICATION id="DOC1" time="after DCT" type1="statin" type2="">
<MEDICATION id="M3" start="7111" end="7133" text="adding ASA and lipitor" time="after DCT" type1="statin" type2="" comment=""/>
<MEDICATION id="M4" start="7126" end="7133" text="lipitor" time="after DCT" type1="statin" type2="" comment=""/>
</MEDICATION>
<DIABETES id="DOC2" time="before DCT" indicator="mention">
<DIABETES id="D0" start="296" end="299" text="DM2" time="before DCT" indicator="mention" comment=""/>
<DIABETES id="D1" start="296" end="299" text="DM2" time="before DCT" indicator="mention" comment=""/>
<DIABETES id="D2" start="1180" end="1183" text="DM2" time="before DCT" indicator="mention" comment=""/>
<DIABETES id="D3" start="6444" end="6447" text="DM2" time="before DCT" indicator="mention" comment=""/>
<DIABETES id="D4" start="7195" end="7198" text="DM2" time="before DCT" indicator="mention" comment=""/>
<DIABETES id="D5" start="296" end="299" text="DM2" time="before DCT" indicator="mention" comment=""/>
</DIABETES>
<MEDICATION id="DOC3" time="after DCT" type1="sulfonylureas" type2="">
<MEDICATION id="M5" start="1734" end="1743" text="Glyburide" time="after DCT" type1="sulfonylureas" type2="" comment=""/>
<MEDICATION id="M6" start="1734" end="1743" text="Glyburide" time="after DCT" type1="sulfonylureas" type2="" comment=""/>
<MEDICATION id="M7" start="1734" end="1743" text="Glyburide" time="after DCT" type1="sulfonylureas" type2="" comment=""/>
</MEDICATION>
<MEDICATION id="DOC4" time="after DCT" type1="metformin" type2="">
<MEDICATION id="M8" start="1758" end="1768" text="Glucopahge" time="after DCT" type1="metformin" type2="" comment=""/>
<MEDICATION id="M9" start="1758" end="1768" text="Glucopahge" time="after DCT" type1="metformin" type2="" comment=""/>
<MEDICATION id="M10" start="1758" end="1768" text="Glucopahge" time="after DCT" type1="metformin" type2="" comment=""/>
</MEDICATION>
<MEDICATION id="DOC5" time="during DCT" type1="metformin" type2="">
<MEDICATION id="M11" start="1758" end="1768" text="Glucopahge" time="during DCT" type1="metformin" type2="" comment=""/>
<MEDICATION id="M12" start="1758" end="1768" text="Glucopahge" time="during DCT" type1="metformin" type2="" comment=""/>
<MEDICATION id="M13" start="1758" end="1768" text="Glucopahge" time="during DCT" type1="metformin" type2="" comment=""/>
</MEDICATION>
<HYPERTENSION id="DOC6" time="during DCT" indicator="high bp">
<HYPERTENSION id="H0" start="2100" end="2106" text="149/84" time="during DCT" indicator="high bp" comment=""/>
<HYPERTENSION id="H1" start="828" end="834" text="145/79" time="during DCT" indicator="high bp" comment=""/>
</HYPERTENSION>
<MEDICATION id="DOC7" time="before DCT" type1="ACE inhibitor" type2="">
<MEDICATION id="M14" start="1782" end="1789" text="Zestril" time="before DCT" type1="ACE inhibitor" type2="" comment=""/>
<MEDICATION id="M15" start="1782" end="1789" text="Zestril" time="before DCT" type1="ACE inhibitor" type2="" comment=""/>
<MEDICATION id="M16" start="1782" end="1789" text="Zestril" time="before DCT" type1="ACE inhibitor" type2="" comment=""/>
</MEDICATION>
<SMOKER id="DOC8" status="current">
<SMOKER id="S0" start="7163" end="7191" text=" smoking cessation consult. " status="current" comment=""/>
<SMOKER id="S1" start="1965" end="1995" text="has smoked 1/2ppd for 35 years" status="current" comment=""/>
<SMOKER id="S2" start="1969" end="1995" text="smoked 1/2ppd for 35 years" status="current" comment=""/>
</SMOKER>
<MEDICATION id="DOC9" time="before DCT" type1="metformin" type2="">
<MEDICATION id="M17" start="1758" end="1768" text="Glucopahge" time="before DCT" type1="metformin" type2="" comment=""/>
<MEDICATION id="M18" start="1758" end="1768" text="Glucopahge" time="before DCT" type1="metformin" type2="" comment=""/>
<MEDICATION id="M19" start="1758" end="1768" text="Glucopahge" time="before DCT" type1="metformin" type2="" comment=""/>
</MEDICATION>
<MEDICATION id="DOC10" time="after DCT" type1="ACE inhibitor" type2="">
<MEDICATION id="M20" start="1782" end="1789" text="Zestril" time="after DCT" type1="ACE inhibitor" type2="" comment=""/>
<MEDICATION id="M21" start="1782" end="1789" text="Zestril" time="after DCT" type1="ACE inhibitor" type2="" comment=""/>
<MEDICATION id="M22" start="1782" end="1789" text="Zestril" time="after DCT" type1="ACE inhibitor" type2="" comment=""/>
</MEDICATION>
<MEDICATION id="DOC11" time="during DCT" type1="sulfonylureas" type2="">
<MEDICATION id="M23" start="1734" end="1743" text="Glyburide" time="during DCT" type1="sulfonylureas" type2="" comment=""/>
<MEDICATION id="M24" start="1734" end="1743" text="Glyburide" time="during DCT" type1="sulfonylureas" type2="" comment=""/>
<MEDICATION id="M25" start="1734" end="1743" text="Glyburide" time="during DCT" type1="sulfonylureas" type2="" comment=""/>
</MEDICATION>
<DIABETES id="DOC12" time="during DCT" indicator="mention">
<DIABETES id="D6" start="296" end="299" text="DM2" time="during DCT" indicator="mention" comment=""/>
<DIABETES id="D7" start="296" end="299" text="DM2" time="during DCT" indicator="mention" comment=""/>
<DIABETES id="D8" start="1180" end="1183" text="DM2" time="during DCT" indicator="mention" comment=""/>
<DIABETES id="D9" start="6444" end="6447" text="DM2" time="during DCT" indicator="mention" comment=""/>
<DIABETES id="D10" start="7195" end="7198" text="DM2" time="during DCT" indicator="mention" comment=""/>
<DIABETES id="D11" start="296" end="299" text="DM2" time="during DCT" indicator="mention" comment=""/>
</DIABETES>
<MEDICATION id="DOC13" time="after DCT" type1="aspirin" type2="">
<MEDICATION id="M26" start="7111" end="7133" text="adding ASA and lipitor" time="after DCT" type1="aspirin" type2="" comment=""/>
<MEDICATION id="M27" start="7118" end="7121" text="ASA" time="after DCT" type1="aspirin" type2="" comment=""/>
</MEDICATION>
<FAMILY_HIST id="DOC14" indicator="not present">
<FAMILY_HIST id="F0" indicator="not present"/>
<FAMILY_HIST id="F1" indicator="not present"/>
<FAMILY_HIST id="F2" indicator="not present"/>
</FAMILY_HIST>
<DIABETES id="DOC15" time="after DCT" indicator="mention">
<DIABETES id="D12" start="296" end="299" text="DM2" time="after DCT" indicator="mention" comment=""/>
<DIABETES id="D13" start="296" end="299" text="DM2" time="after DCT" indicator="mention" comment=""/>
<DIABETES id="D14" start="1180" end="1183" text="DM2" time="after DCT" indicator="mention" comment=""/>
<DIABETES id="D15" start="6444" end="6447" text="DM2" time="after DCT" indicator="mention" comment=""/>
<DIABETES id="D16" start="7195" end="7198" text="DM2" time="after DCT" indicator="mention" comment=""/>
<DIABETES id="D17" start="296" end="299" text="DM2" time="after DCT" indicator="mention" comment=""/>
</DIABETES>
<MEDICATION id="DOC16" time="before DCT" type1="sulfonylureas" type2="">
<MEDICATION id="M28" start="1734" end="1743" text="Glyburide" time="before DCT" type1="sulfonylureas" type2="" comment=""/>
<MEDICATION id="M29" start="1734" end="1743" text="Glyburide" time="before DCT" type1="sulfonylureas" type2="" comment=""/>
<MEDICATION id="M30" start="1734" end="1743" text="Glyburide" time="before DCT" type1="sulfonylureas" type2="" comment=""/>
</MEDICATION>
<PHI id="P0" start="16" end="26" text="2078-09-07" TYPE="DATE"/>
<PHI id="P1" start="39" end="54" text="RYBURY HOSPITAL" TYPE="HOSPITAL"/>
<PHI id="P2" start="88" end="102" text="Goldberg, Joel" TYPE="PATIENT"/>
<PHI id="P3" start="110" end="117" text="0370149" TYPE="MEDICALRECORD"/>
<PHI id="P4" start="139" end="147" text="9/6/2078" TYPE="DATE"/>
<PHI id="P5" start="159" end="164" text="Lange" TYPE="DOCTOR"/>
<PHI id="P6" start="165" end="171" text="Bailey" TYPE="DOCTOR"/>
<PHI id="P7" start="184" end="191" text="Schmidt" TYPE="DOCTOR"/>
<PHI id="P8" start="201" end="212" text="Odom, Kacie" TYPE="DOCTOR"/>
<PHI id="P9" start="267" end="269" text="48" TYPE="AGE"/>
<PHI id="P10" start="441" end="445" text="July" TYPE="DATE"/>
<PHI id="P11" start="1197" end="1201" text="2075" TYPE="DATE"/>
<PHI id="P12" start="1422" end="1430" text="12/14/76" TYPE="DATE"/>
<PHI id="P13" start="1474" end="1481" text="9/03/76" TYPE="DATE"/>
<PHI id="P14" start="1554" end="1561" text="11/6/76" TYPE="DATE"/>
<PHI id="P15" start="1635" end="1642" text="7/20/76" TYPE="DATE"/>
<PHI id="P16" start="1671" end="1675" text="4/75" TYPE="DATE"/>
<PHI id="P17" start="1866" end="1879" text="Arroyo Grande" TYPE="CITY"/>
<PHI id="P18" start="1927" end="1938" text="copy editor" TYPE="PROFESSION"/>
<PHI id="P19" start="2717" end="2720" text="RSC" TYPE="HOSPITAL"/>
<PHI id="P20" start="2740" end="2748" text="09/06/78" TYPE="DATE"/>
<PHI id="P21" start="5510" end="5521" text="21-Jul-2076" TYPE="DATE"/>
<PHI id="P22" start="5638" end="5647" text="24-Jul-76" TYPE="DATE"/>
<PHI id="P23" start="6427" end="6429" text="48" TYPE="AGE"/>
<PHI id="P24" start="7431" end="7441" text="Ian Jurado" TYPE="DOCTOR"/>
<PHI id="P25" start="7485" end="7490" text="14558" TYPE="PHONE"/>
</TAGS>
</root>
每当我尝试在句子基础上对上面的文本进行标记时,NLTK就会混淆并将一段时间(&#34;。&#34;)作为一个句子整理出来。
答案 0 :(得分:0)
此文件中的某些行(实际段落)包含多个句子。将文件拆分为行,然后将句子标记生成器分别应用于每一行。这将阻止合并来自不同行的文本,并且比滚动自己的基于正则表达式的句子分割器提供更好的结果。例如:
text = file.read()
lines = text.splitlines()
sentences = [ s for line in lines for s in nltk.sent_tokenize(line) ]