拼音命名实体识别

时间:2021-01-14 14:36:16

标签: pandas nlp named-entity-recognition named-entity-extraction deeppavlov

我正在尝试进行命名实体识别或从拼音或汉字的罗马化中提取人、地等。

例如(来自维基百科):

 "Jiang Zemin, Li Peng and Zhu Rongji led the nation in the 1990s. Under their administration, China's economic performance pulled an estimated 150 million peasants out of poverty and sustained an average annual gross domestic product growth rate of 11.2%.[125][better source needed][126][better source needed] The country joined the World Trade Organization in 2001, and maintained its high rate of economic growth under Hu Jintao and Wen Jiabao's leadership in the 2000s. However, the growth also severely impacted the country's resources and environment,[127][128] and caused major social displacement.[129][130]
Chinese Communist Party general secretary Xi Jinping has ruled since 2012 and has pursued large-scale efforts to reform China's economy [131][132] (which has suffered from structural instabilities and slowing growth),[133][134][135] and has also reformed the one-child policy and prison system,[136] as well as instituting a vast anti corruption crackdown.[137] In 2013, China initiated the Belt and Road Initiative, a global infrastructure investment project.[138] The COVID-19 pandemic broke out in Wuhan, Hubei in 2019.[139][140]"

我希望从上面提取实体,例如:

Jiang Zemin
Li Peng
Zhu Rongji
Hu Jintao
Wuhan
Hubei
etc...

汉字NER很复杂,但我不知道有什么方法可以提取拼音。

我目前的计划是尝试以下 1300 多个中文音节的所有排列:

import pandas as pd
import numpy as np

#import data
data = pd.read_csv('chinese_tones.txt', sep=" ", header=None)
data.columns = ["pinyin", "character"]

#convert
data['pinyin'] = data['pinyin'].str.replace('\d+', '') #data doesn't have tones, which makes this harder
s = data['pinyin'].drop_duplicates().to_numpy()
combos = pd.Series(np.add.outer(s, s).ravel())

#combine to giant list
all_pinyin = pd.Series(s.tolist() + np.add.outer(s, s).ravel().tolist())

然后我打算做一些类似的事情 .isin() 将文本数据与拼音列表进行比较。

有谁知道提取实体拼音的更好方法吗?

0 个答案:

没有答案