python - 从字符串中搜索和计算bigrams(计算字符串中的子字符串出现次数)?

时间:2017-03-15 02:37:34

标签: python string text-mining n-gram

目标是获取字符串中的二元组出现次数 换句话说,如何在更大的字符串中获取子字符串的计数?

# Sample data with text
hi = {1: "My name is Lance John", 
  2: "Am working at Savings Limited in Germany",
  3: "Have invested in mutual funds",
  4: "Savings Limited accepts mutual funds as investment option",
  5: "Savings Limited also accepts other investment option"}

hi = pd.DataFrame(hi.items(), columns = ['id', 'notes'])
# have two categories with pre-defined words
name = ['Lance John', 'Germany']
finance = ['Savings Limited', 'investment option', 'mutual funds']

# want count of bigrams in each category for each record
# the output should look like this

ID name finance  
1    1    0  
2    1    2
3    0    1
4    0    3
5    0    2

2 个答案:

答案 0 :(得分:0)

可以使用正则表达式完成。我们经常假设正则表达式是"魔术"因为他们可以在单个函数调用中完成所有操作。

我不知道在不同组中查找不同单词的正则表达式是否比更多手动搜索更有效 - 但它肯定比纯Python代码中的手动搜索更有效,因为搜索发生在紧密循环中运行的高度优化的字节码中。

所以,如果你只有一个小组,那么你需要的只是一个正则表达式,你的模式由"或" (|)regexp运算符 - 它将匹配每个单词。你可以使用" finditer" regexp方法,以及collections.Counter数据结构,以总结每个单词的出现次数:

In [56]: test = "parrot parrot bicycle parrot inquisition bicycle parrot"

In [57]: expression = re.compile("parrot|bicycle|inquisition")

In [58]: Counter(match.group() for match in expression.finditer(test))
Out[58]: Counter({'parrot': 4, 'bicycle': 2, 'inquisition': 1})

现在,您扩展了这个概念 - 将关联表达式放在名为groups的正则表达式中(括号括起来的子模式,括号内的?P<groupname>前缀,文字括号< >表示组名)。每个组主体都是上面单词的序列,每个组名称都是您的集合名称 - 所以:

 expression = r'(?P<finance>Savings\ Limited|investment\ option|mutual\ funds)|(?P<name>Lance\ John|Germany)')

相应地为您提供的示例生成名为financename的组中的匹配项。要使用计数器来解决这个问题,我们必须使用表达式匹配对象的groupdict方法,并获取结果字典的键 -

In[65]: Counter(m.groupdict().keys()[0] for m in expression.finditer(hi[1]))
Out[65]: Counter({'finance': 1})

现在只需要以编程方式构建表达式,而不必对其进行硬编码 - 可以使用两个嵌套&#34; join&#34;运算符 - 用于连接组的外部运算符,以及用于连接每个组中的术语的内部运算符。

如果你把你的术语放在字典中,而不是将每个术语命名为一个孤立的变量,那将会更优雅 - 所以你有:

 domains = {'finance': [...], 'names': [...]} 

上面的正则表达式可以通过以下方式构建:

groups = []
for groupname in domains.keys():
    term_group = "|".join(re.escape(term) for term in terms)
    groups.append(r"(?P<{}>{})".format(groupname, term_group)  ) 
expression = re.compile("|".join(groups))

然后,只需弄清楚你的数据:

data = []
for key, textline in hi.items():
    data.append((key, Counter(m.groupdict().keys()[0] for m in expression.finditer(textline)) ))

(并且在旁注中,了解尝试使用嵌套生成器表达式构建正则表达式是多么难以理解):

 expression = re.compile("|".join("(?P<{0}>{1})".format(
      groupname,
      "|".join(
          "{}".format(
                  re.escape(term)) for term in domains[groupname]
           )
       ) for group in domains.keys() )
 )

答案 1 :(得分:0)

hi = {1: "My name is Lance John. Lance John is senior marketing analyst", 
      2: "Am working at Savings Limited in Germany",
      3: "Have invested in mutual funds",
      4: "Savings Limited accepts mutual funds as investment option",
      5: "Savings Limited also accepts other investment option"}

hi = pd.DataFrame(hi.items(), columns = ['id', 'notes'])
name = ['Lance John', 'Germany', 'senior', 'working']
finance = ['Savings Limited', 'investment option', 'mutual funds']

def f(cell_value):
    return [((v[1])) for v in ((s, cell_value.count(s)) for s in search) if v]

search = name
df=hi['notes'].apply(f)


search = finance
df1=hi['notes'].apply(f)

df2 = pd.DataFrame({'name': df.apply(np.count_nonzero), 'finance': df1.apply(np.count_nonzero), 'text': hi['notes']})

能够使用此链接Counting appearances of multiple substrings in a cell pandas解决问题 只修改代码以使用count_nonzero而不是直接和

来计算唯一的外观