如何在python中给定文本中的给定单词之前和之后找到最常用的单词?

时间:2013-11-16 16:00:30

标签: python-2.7 nlp nltk n-gram

我有一个大文本,我试图在本文中给定单词之前和之后最常出现单词出现。

例如:

我想知道“湖”之后最常出现的词是什么。 Idealy会得到类似的东西:(单词1,#occurrence),(单词2,#occurrence),...

以前的话语也是如此......

我尝试了NLTK bigran,但它似乎只能找到最常见的n-grans ...是否有可能以某种方式修复其中一个单词并找到基于固定单词的最常见的n-grans?

感谢您的帮助!!

2 个答案:

答案 0 :(得分:2)

你在找这样的东西吗?

text = """
A lake is a body of relatively still water of considerable size, localized in a basin, that is surrounded by land apart from a river, stream, or other form of moving water that serves to feed or drain the lake. Lakes are inland and not part of the ocean and therefore are distinct from lagoons, and are larger and deeper than ponds.[1][2] Lakes can be contrasted with rivers or streams, which are usually flowing. However most lakes are fed and drained by rivers and streams.
Natural lakes are generally found in mountainous areas, rift zones, and areas with ongoing glaciation. Other lakes are found in endorheic basins or along the courses of mature rivers. In some parts of the world there are many lakes because of chaotic drainage patterns left over from the last Ice Age. All lakes are temporary over geologic time scales, as they will slowly fill in with sediments or spill out of the basin containing them.
Many lakes are artificial and are constructed for industrial or agricultural use, for hydro-electric power generation or domestic water supply, or for aesthetic or recreational purposes.
Etymology, meaning, and usage of "lake"[edit]
Oeschinen Lake in the Swiss Alps
Lake Tahoe on the border of California and Nevada
The Caspian Sea is either the world's largest lake or a full-fledged sea.[3]
The word lake comes from Middle English lake ("lake, pond, waterway"), from Old English lacu ("pond, pool, stream"), from Proto-Germanic *lakō ("pond, ditch, slow moving stream"), from the Proto-Indo-European root *leǵ- ("to leak, drain"). Cognates include Dutch laak ("lake, pond, ditch"), Middle Low German lāke ("water pooled in a riverbed, puddle"), German Lache ("pool, puddle"), and Icelandic lækur ("slow flowing stream"). Also related are the English words leak and leach.
There is considerable uncertainty about defining the difference between lakes and ponds, and no current internationally accepted definition of either term across scientific disciplines or political boundaries exists.[4] For example, limnologists have defined lakes as water bodies which are simply a larger version of a pond, which can have wave action on the shoreline or where wind-induced turbulence plays a major role in mixing the water column. None of these definitions completely excludes ponds and all are difficult to measure. For this reason there has been increasing use made of simple size-based definitions to separate ponds and lakes. One definition of lake is a body of water of 2 hectares (5 acres) or more in area;[5]:331[6] however, others[who?] have defined lakes as waterbodies of 5 hectares (12 acres) and above,[citation needed] or 8 hectares (20 acres) and above[citation needed] (see also the definition of "pond"). Charles Elton, one of the founders of ecology, regarded lakes as waterbodies of 40 hectares (99 acres) or more.[7] The term lake is also used to describe a feature such as Lake Eyre, which is a dry basin most of the time but may become filled under seasonal conditions of heavy rainfall. In common usage many lakes bear names ending with the word pond, and a lesser number of names ending with lake are in quasi-technical fact, ponds. One textbook illustrates this point with the following: "In Newfoundland, for example, almost every lake is called a pond, whereas in Wisconsin, almost every pond is called a lake."[8]
One hydrology book proposes to define it as a body of water with the following five chacteristics:[4]
it partially or totally fills one or several basins connected by straits[4]
has essentially the same water level in all parts (except for relatively short-lived variations caused by wind, varying ice cover, large inflows, etc.)[4]
it does not have regular intrusion of sea water[4]
a considerable portion of the sediment suspended in the water is captured by the basins (for this to happen they need to have a sufficiently small inflow-to-volume ratio)[4]
the area measured at the mean water level exceeds an arbitrarily chosen threshold (for instance, one hectare)[4]
With the exception of the sea water intrusion criterion, the other ones have been accepted or elaborated upon by other hydrology publications.[9][10]
""".split()

from nltk import bigrams

bgs = bigrams(text)
lake_bgs = filter(lambda item: item[0] == 'lake', bgs)

from collections import Counter
c = Counter(map(lambda item: item[1], lake_bgs))
print c.most_common()

哪个输出:

[('is', 4), ('("lake,', 1), ('or', 1), ('comes', 1), ('are', 1)]

请注意,如果您的文字很长,则可能需要使用ifilter, imap, etc...

编辑:以下是'lake'之后之前的代码。

from nltk import trigrams

tgs = trigrams(text)
lake_tgs = filter(lambda item: item[1] == 'lake', tgs)

from collections import Counter

before_lake = map(lambda item: item[0], lake_tgs)
after_lake = map(lambda item: item[2], lake_tgs)

c = Counter(before_lake + after_lake)
print c.most_common()

请注意,这也可以使用bigrams完成:)

答案 1 :(得分:2)

只是为了添加@Ohad的答案,这是NLTK中的ngram实现,具有一定的可扩展性。

#-*- coding: utf8 -*-

import string
from nltk import ngrams
from itertools import chain
from collections import Counter

text = """
A lake is a body of relatively still water of considerable size, localized in a basin, that is surrounded by land apart from a river, stream, or other form of moving water that serves to feed or drain the lake. Lakes are inland and not part of the ocean and therefore are distinct from lagoons, and are larger and deeper than ponds.[1][2] Lakes can be contrasted with rivers or streams, which are usually flowing. However most lakes are fed and drained by rivers and streams.
Natural lakes are generally found in mountainous areas, rift zones, and areas with ongoing glaciation. Other lakes are found in endorheic basins or along the courses of mature rivers. In some parts of the world there are many lakes because of chaotic drainage patterns left over from the last Ice Age. All lakes are temporary over geologic time scales, as they will slowly fill in with sediments or spill out of the basin containing them.
Many lakes are artificial and are constructed for industrial or agricultural use, for hydro-electric power generation or domestic water supply, or for aesthetic or recreational purposes.
Etymology, meaning, and usage of "lake"[edit]
Oeschinen Lake in the Swiss Alps
Lake Tahoe on the border of California and Nevada
The Caspian Sea is either the world's largest lake or a full-fledged sea.[3]
The word lake comes from Middle English lake ("lake, pond, waterway"), from Old English lacu ("pond, pool, stream"), from Proto-Germanic *lakō ("pond, ditch, slow moving stream"), from the Proto-Indo-European root *leǵ- ("to leak, drain"). Cognates include Dutch laak ("lake, pond, ditch"), Middle Low German lāke ("water pooled in a riverbed, puddle"), German Lache ("pool, puddle"), and Icelandic lækur ("slow flowing stream"). Also related are the English words leak and leach.
There is considerable uncertainty about defining the difference between lakes and ponds, and no current internationally accepted definition of either term across scientific disciplines or political boundaries exists.[4] For example, limnologists have defined lakes as water bodies which are simply a larger version of a pond, which can have wave action on the shoreline or where wind-induced turbulence plays a major role in mixing the water column. None of these definitions completely excludes ponds and all are difficult to measure. For this reason there has been increasing use made of simple size-based definitions to separate ponds and lakes. One definition of lake is a body of water of 2 hectares (5 acres) or more in area;[5]:331[6] however, others[who?] have defined lakes as waterbodies of 5 hectares (12 acres) and above,[citation needed] or 8 hectares (20 acres) and above[citation needed] (see also the definition of "pond"). Charles Elton, one of the founders of ecology, regarded lakes as waterbodies of 40 hectares (99 acres) or more.[7] The term lake is also used to describe a feature such as Lake Eyre, which is a dry basin most of the time but may become filled under seasonal conditions of heavy rainfall. In common usage many lakes bear names ending with the word pond, and a lesser number of names ending with lake are in quasi-technical fact, ponds. One textbook illustrates this point with the following: "In Newfoundland, for example, almost every lake is called a pond, whereas in Wisconsin, almost every pond is called a lake."[8]
One hydrology book proposes to define it as a body of water with the following five chacteristics:[4]
it partially or totally fills one or several basins connected by straits[4]
has essentially the same water level in all parts (except for relatively short-lived variations caused by wind, varying ice cover, large inflows, etc.)[4]
it does not have regular intrusion of sea water[4]
a considerable portion of the sediment suspended in the water is captured by the basins (for this to happen they need to have a sufficiently small inflow-to-volume ratio)[4]
the area measured at the mean water level exceeds an arbitrarily chosen threshold (for instance, one hectare)[4]
With the exception of the sea water intrusion criterion, the other ones have been accepted or elaborated upon by other hydrology publications.[9][10]
"""

def ngrammer(txt, n):
    # Removes punctuations and numbers.
    sentences = "".join([i for i in txt if i not in string.punctuation and not i.isdigit()]).split('\n')
    return list(chain(*[ngrams(i.split(), n) for i in sentences]))

def before_after(ngs, word):
    word_grams = filter(lambda item: item[1] == word, ngs)
    before = map(lambda item: item[0], ngs)
    after = map(lambda item: item[2], ngs)
    return before, after

bgs = ngrammer(text,2) # bigrams
tgs = ngrammer(text,3) # trigrams
xgs = ngrammer(text,10) # 10grams

focus = 'lake'
bf, af = before_after(xgs, focus)
c = Counter(bf+af)

# Most common word before and after 'lake' from the 10grams.
print c.most_common()[0]