问题:
我正在尝试从职位描述中提取一系列专有名词,如下所示。
text = "Civil, Mechanical, and Industrial Engineering majors are preferred."
我想从这段文字中提取以下内容:
Civil Engineering
Mechanical Engineering
Industrial Engineering
这是问题的一种情况,因此无法使用特定于应用程序的信息。例如,我无法列出专业名称,然后尝试检查这些专业名称的一部分是否与单词“ major”一起出现在句子中,因为其他句子也需要该名称。
尝试:
1.我研究了 spacy dependency-parsing,但是在每种工程类型(土木,机械,工业)和“工程”一词之间都没有出现亲子关系。
< / p>
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Civil, Mechanical, and Industrial Engineering majors are preferred.")
print( "%-15s%-15s%-15s%-15s%-30s" % ( "TEXT","DEP","HEAD TEXT","HEAD POS","CHILDREN" ) )
for token in doc:
if not token.text in ( ',','.' ):
print( "%-15s%-15s%-15s%-15s%-30s" %
(
token.text
,token.dep_
,token.head.text
,token.head.pos_
,','.join( str(c) for c in token.children )
) )
...输出...
TEXT DEP HEAD TEXT HEAD POS CHILDREN Civil amod majors NOUN ,,Mechanical Mechanical conj Civil ADJ ,,and and cc Mechanical PROPN Industrial compound Engineering PROPN Engineering compound majors NOUN Industrial majors nsubjpass preferred VERB Civil,Engineering are auxpass preferred VERB preferred ROOT preferred VERB majors,are,.
我也尝试过使用nltk pos标记,但是得到以下信息...
导入nltk nltk.pos_tag(nltk.word_tokenize('最好是土木,机械和工业工程专业。'))
[('Civil', 'NNP'), (',', ','), ('Mechanical', 'NNP'), (',', ','), ('and', 'CC'), ('Industrial', 'NNP'), ('Engineering', 'NNP'), ('majors', 'NNS'), ('are', 'VBP'), ('preferred', 'VBN'), ('.', '.')]
工程学的类型和工程学一词都是NNP(专有名词),因此,我能想到的任何一种RegexpParser模式都行不通。
问题:
有人知道在Python 3中提取这些名词短语对的方法吗?
编辑:其他示例
以下示例与第一个示例类似,不同的是它们是动词-名词/动词-专有名词版本。
text="Experience with testing and automating API’s/GUI’s for desktop and native iOS/Android" Extract: testing API’s/GUI’s automation API’s/GUI’s
text="Design, build, test, deploy and maintain effective test automation solutions" Extract: Design test automation solutions build test automation solutions test test automation solutions deploy test automation solutions maintain test automation solutions
答案 0 :(得分:0)
在没有任何外部导入的情况下,并且假设列表始终以逗号分隔,并在最后一个之后加上可选的“和”,可以编写一些正则表达式并进行一些字符串操作以获得所需的输出:
import re
test_string = "Civil, Mechanical, and Industrial Engineering majors are preferred."
result = re.search(r"(([A-Z][a-z]+, )+(and)? [A-Z][a-z]+ ([A-Z][a-z]+))+", test_string)
group_type = result.group(4)
string_list = result.group(1).rstrip(group_type).strip()
items = [i.strip().strip('and ') + ' ' + group_type for i in string_list.split(',')]
print(items) # ['Civil Engineering', 'Mechanical Engineering', 'Industrial Engineering']
同样,所有这些都是基于狭义的列表格式假设。如果存在更多可能性,则可能需要修改正则表达式模式。