下面我有一个阿拉伯语XML是一个小样本。我希望阻止除<en>
标记之外的所有XML,我希望在原始XML文件中更改单词。
<?xml version='1.0' encoding='UTF-8' ?>
<TEXT>
<PHRASE>
<PSEUDO-V>ان</PSEUDO-V>
<N>وزير</N>
<N>الخارجية</N>
<en x='PERS'>فرانك فالتر شتاينماير</en>
<V y='0'>سيتوجه</V>
<N>السبت</N>
<PREP>إلى</PREP>
<en x='LOC'>الشرق الأوسط</en>
</PHRASE>
<PHRASE>
<V>علم</V>
<N>الأهل</N>
<PREP>ب</PREP>
<N y='1'>مغادرت</N>
<en x='PERS'>البابا</en>
<PREP y='1'>إلى</PREP>
<en x='LOC'>المدينة مكة</en>
</PHRASE>
<PHRASE>
</TEXT>
我尝试了以下但由于某种原因它不起作用。
注意: <en>
标记中的X属性为:LOC-PERS-DATE-ORG
import re
import xml.etree.ElementTree as ET
from nltk.stem.isri import ISRIStemmer
tree2 = ET.parse('TrainBaseEnglishcopy.xml')
root2 = tree2.getroot()
for phrase in root2.findall('./PHRASE'):
ens = {en.get('x'): en.text for en in phrase.findall('en')}
if not ('ORG' in ens and 'PERS' in ens and 'LOC' in ens and 'DATE' in ens):
phrase=st.stem(phrase)
我收到了错误:
Traceback (most recent call last):
File "20Dec.py", line 475,
in <module> phrase=st.stem(phrase)
File /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/stem/isri.py", line 153,
in stem token = self.norm(token, 1) # remove diacritics which representing Arabic short vowels
File /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/stem/isri.py", line 186,
in norm word = self.re_short_vowels.sub('', word) TypeError: expected string or bytes-like object –
注意:自行创作工作正常 例如
w = 'يعمل'
print (st.stem(w))
工作正常。
--- Update-- 我必须这样工作,但我必须重复每个标签,但它没有改变原始XML文件中的文字,任何想法?
for phrase in root2.findall('./PHRASE/N'):
ens = {en.get('x'): en.text for en in phrase.findall('en')}
if not ('ORG' in ens and 'PERS' in ens and 'LOC' in ens and 'DATE' in ens):
phrase.text=st.stem(phrase.text)
print(phrase.text)
答案 0 :(得分:1)
要修改XML文件,您应该使用最后的tree.write
命令提交它:
tree2 = ET.parse('TrainBaseEnglishcopy.xml')
root2 = tree2.getroot()
# ...manipulate tree...
tree2.write("out.xml", encoding="UTF-8")