缩写-选择非停用词的第一个字符

时间:2018-12-20 18:43:34

标签: python python-3.x string pandas nltk

给出一个停用词列表和一个数据框,该数据框的第一列具有所示的完整格式-

stopwords = ['of', 'and', '&', 'com', 'org']
df = pd.DataFrame({'Full form': ['World health organization', 'Intellectual property', 'royal bank of canada']})
df

+---+---------------------------+
|   |         Full form         |
+---+---------------------------+
| 0 | World health organization |
| 1 | Intellectual property     |
| 2 | Royal bank of canada      |
+---+---------------------------+

我正在寻找一种使相邻列的缩写忽略停用词(如果有)的方法。

预期输出:

+---+---------------------------+----------------+
|   |         Full form         |   Abbreviation |
+---+---------------------------+----------------+
| 0 | World health organization |   WHO          |
| 1 | Intellectual property     |   IP           |
| 2 | Royal bank of canada      |   RBC          |
+---+---------------------------+----------------+

3 个答案:

答案 0 :(得分:2)

这应该做到:

import pandas as pd

stopwords = ['of', 'and', '&', 'com', 'org']
df = pd.DataFrame({'Full form': ['World health organization', 'Intellectual property', 'royal bank of canada']})


def abbrev(t, stopwords=stopwords):
    return ''.join(u[0] for u in t.split() if u not in stopwords).upper()


df['Abbreviation'] = df['Full form'].apply(abbrev)

print(df)

输出

                   Full form Abbreviation
0  World health organization          WHO
1      Intellectual property           IP
2       royal bank of canada          RBC

答案 1 :(得分:1)

另一种方法:

print(y)

答案 2 :(得分:1)

这是一个正则表达式解决方案:

stopwods = ['of', 'and', '&', 'com', 'org']
stopwords_re = r"(?!" + r"\b|".join(stopwords) + r"\b)"
abbv_re = r"\b{}\w".format(stopwords_re)

def abbrv(s):
    return "".join(re.findall(abbv_re, s)).upper()

[输出]:

>>> abbrv('royal bank of scotland')
'RBS'

与大熊猫一起使用:

df['Abbreviation'] = df['Full form'].apply(abbrv)

有关正则表达式的完整说明,请参见:https://regex101.com/r/3Q0XXF/1

简而言之,

  • \b{}\w:查找单词边界之后的所有字符
  • (?!of\b|and\b|&\b):除非在停用词列表中