Question

如何使用Pandas将POS标签用于另一列的一个单词？

例如我有：

col1      col2
aaa1      AAa1 is a great friend
abb2      abb2 is a very good friend

我想输出：

NNP is a great friend
NN is a very good friend

我试试：

from nltk import pos_tag
columns = ['col1', 'col2']
data = pd.read_csv('data.csv', delimiter='\t', names=columns)
data["col2"] = data.apply(lambda x: x["col1"].replace(x["col1"], pos_tag([x["col2"], x["col1"]])[1][1]), axis=1)

但它没有工作，也没有忽略大小写字母。我的col1只是小写的，在col2中我有小写和大写的单词。如何申请re sub模块？我想将它用于每一行（约4百万行）

编辑：如果我尝试使用：

data["col2"] = data.apply(lambda x: re.sub(x["col1"], pos_tag([x["col1"].lower(), x["col1"].lower()]), x["col2"], flags=re.I), axis=1)

它不起作用。因为我希望输出原始的大小写字母。这是我的目标问题 - ＆gt;我想用NNP替换真正的字符串和SVM分类器的NN。

你知道怎么做吗？

Answer 1

一个简单的功能可以完成这项任务。你可以这样做：

## import libraries
from nltk import word_tokenize, pos_tag, pos_tag_sents

## tag the sentece
df['col2'] = df['col2'].apply(word_tokenize).apply(pos_tag)

## this function does the magic 
def get_vals(lst):
    op = [] 
    for i, v in enumerate(lst):
        if i == 0:
            op.append(v[1])
        else:
            op.append(v[0])
    return ' '.join(op)

## apply the function
df['col2'] = df['col2'].apply(get_vals)

print(df)

   col1                      col2
0  aaa1     NNP is a great friend
1  abb2  NN is a very good friend

更新了解决方案：

此解决方案适合用任何位置替换带有POS标签的字符串。

df = pd.DataFrame({'col1':['aaa1','abb2','mtmb2','mmm2','bb2'],
                   'col2':['AAa1 is a great friend','abb2 is a very good friend','MTMB2 is a my sentence','Your MmM2 is my sentence','Your sentence is bb2']})

## import libraries
from nltk import word_tokenize, pos_tag, pos_tag_sents

## tag the sentece
df['col2'] = df['col2'].str.lower().apply(word_tokenize).apply(pos_tag)
vals = df['col1'].tolist()

## this function does the magic 
def get_vals(lst):
    op = [] 
    for i, v in enumerate(lst):
        if v[0] in vals:
            op.append(v[1])
        else:
            op.append(v[0])

    return ' '.join(op)

## apply the function
df['col3'] = df['col2'].apply(get_vals)

如何将POS标签功能应用于来自另一列pandas的句子中的字符串

1 个答案: