如何使用Pandas将POS标签用于另一列的一个单词?
例如我有:
col1 col2
aaa1 AAa1 is a great friend
abb2 abb2 is a very good friend
我想输出:
NNP is a great friend
NN is a very good friend
我试试:
from nltk import pos_tag
columns = ['col1', 'col2']
data = pd.read_csv('data.csv', delimiter='\t', names=columns)
data["col2"] = data.apply(lambda x: x["col1"].replace(x["col1"], pos_tag([x["col2"], x["col1"]])[1][1]), axis=1)
但它没有工作,也没有忽略大小写字母。 我的col1只是小写的,在col2中我有小写和大写的单词。如何申请re sub模块? 我想将它用于每一行(约4百万行)
编辑: 如果我尝试使用:
data["col2"] = data.apply(lambda x: re.sub(x["col1"], pos_tag([x["col1"].lower(), x["col1"].lower()]), x["col2"], flags=re.I), axis=1)
它不起作用。因为我希望输出原始的大小写字母。这是我的目标问题 - >我想用NNP替换真正的字符串和SVM分类器的NN。
你知道怎么做吗?
答案 0 :(得分:1)
一个简单的功能可以完成这项任务。你可以这样做:
## import libraries
from nltk import word_tokenize, pos_tag, pos_tag_sents
## tag the sentece
df['col2'] = df['col2'].apply(word_tokenize).apply(pos_tag)
## this function does the magic
def get_vals(lst):
op = []
for i, v in enumerate(lst):
if i == 0:
op.append(v[1])
else:
op.append(v[0])
return ' '.join(op)
## apply the function
df['col2'] = df['col2'].apply(get_vals)
print(df)
col1 col2
0 aaa1 NNP is a great friend
1 abb2 NN is a very good friend
更新了解决方案:
此解决方案适合用任何位置替换带有POS标签的字符串。
df = pd.DataFrame({'col1':['aaa1','abb2','mtmb2','mmm2','bb2'],
'col2':['AAa1 is a great friend','abb2 is a very good friend','MTMB2 is a my sentence','Your MmM2 is my sentence','Your sentence is bb2']})
## import libraries
from nltk import word_tokenize, pos_tag, pos_tag_sents
## tag the sentece
df['col2'] = df['col2'].str.lower().apply(word_tokenize).apply(pos_tag)
vals = df['col1'].tolist()
## this function does the magic
def get_vals(lst):
op = []
for i, v in enumerate(lst):
if v[0] in vals:
op.append(v[1])
else:
op.append(v[0])
return ' '.join(op)
## apply the function
df['col3'] = df['col2'].apply(get_vals)