Question

我是Python（2.7）的新手，所以如果这是一个非常直截了当的问题，请原谅我。我希望（i）从已经用NLTK库标记的文本中提取以 -ing 结尾的所有单词，并且（ii）提取在这样提取的每个单词之前的10个单词。然后，我希望（iii）将这些文件保存为两列的数据框，可能类似于：

Word        PreviousContext 
starting    stood a moment, as if in a troubled reverie; then
seeming     of it retraced our steps. But Elijah passed on, without
purchasing  a sharp look-out upon the hands: Bildad did all the

我知道如何做（i），但我不知道如何去做（ii） - （iii）。任何帮助将不胜感激和承认。到目前为止，我有：

>>> import bs4 
>>> import nltk
>>> from nltk import word_tokenize
>>> url = "http://www.gutenberg.org/files/766/766-h/766-h.htm"
>>> import urllib
>>> response = urllib.urlopen(url)
>>> raw = response.read().decode('utf8')
>>> tokens = word_tokenize(raw)
>>> for w in tokens:
...     if w.endswith("ing"):
...             print(w)
... 
padding
padding
encoding
having
heading
wearying
dismissing
going
nothing
reading etc etc etc..

Answer 1

代码行之后：

>>> tokens = word_tokenize(raw)

使用以下代码生成带有上下文的单词：

>>> context={}
>>> for i,w in enumerate(tokens):
...      if w.endswith("ing"):
...         try:
...            context[w]=tokens[i:i+10]  # this try...except is used to pass last 10 words whose context is less than 10 words.
...         except: pass
... 
>>> fp=open('dataframes','w')   # save results in this file
>>> fp.write('Word'+'\t\t'+'PreviousContext\n')
>>> for word in context:
...    fp.write(word+'\t\t'+' '.join(context[word])+'\n')
... 
>>> fp.close()
>>> fp=open('dataframes','r')  
>>> for line in fp.readlines()[:10]: # first 10 lines of generated file
...    print line
... 
Word                PreviousContext
raining             raining , and I saw more fog and mud in
bidding             bidding him good night , if he were yet sitting
growling            growling old Scotch Croesus with great flaps of ears ?
bright-looking      bright-looking bride , I believe ( as I could not
hanging             hanging up in the shop&mdash ; went down to look
scheming            scheming and devising opportunities of being alone with her .
muffling            muffling her hands in it , in an unsettled and
bestowing           bestowing them on Mrs. Gummidge. She was with him all
adorning            adorning , the perfect simplicity of his manner , brought

有两点需要注意：

nltk将标点符号视为单独的标记，因此标点符号被视为单独的单词。
我使用字典来存储带有上下文的单词，因此单词的顺序将无关紧要，但保证所有带有上下文的单词都存在。

Answer 2

如果你问的是如何在算法上做这个，开始我将始终保持前10个字的队列和数据帧，其中第一列是以'ing'结尾的单词，第二列是相应单词前面的10个单词（在第一列中）。

因此，在程序开始时，队列将为空，然后对于前10个单词，它会将每个单词排入队列。然后每次在循环中向前移动之前，将当前单词排队并使单词出列（确保维护一个大小为10的队列）。

这样，在每次迭代时，您都会检查单词是否以“ing”结尾。如果是，请在数据框中添加一行，其中单词是第一个项目，第二个项目是队列的当前状态。

最后，你应该有一个数据框，第一列的单词以'ing'结尾，而相应的第二列是前面的10个单词。

Answer 3

假设您在单词列表中包含所有单词：

>>> words
['abc', 'def', 'gdi', 'asd', 'ew', 'd', 'ew', 'fdsa', 'dsa', 'aing', 'e', 'f', 'dsa', 'fe', 'dfa', 'e', 'd', 'fe', 'asd', 'fe', 'ting']

我会把它们放到一个系列中并抓住相关单词的索引：

words =  pandas.Series(words)
idx = pandas.Series(words[words.apply(lambda x: x.endswith('ing'))].index)
>>> idx
0     9
1    20
dtype: int64

现在，idx的值是我们原始'ing'中以Series结尾的单词的索引。接下来，我们需要将这些值转换为范围：

starts = idx - 10
ends = idx

现在我们可以使用这些范围索引原始系列（首先，如果'ing'单词在列表中出现少于10个单词，则下限为0的剪辑）：

starts = starts.clip(0)
df = pandas.DataFrame([{
    'word': words[e], 
    'Previous':string.join(words[s:e])} for s,e in zip(starts,ends)])
>>> df
                           Previous  word
0  abc def gdi asd ew d ew fdsa dsa  aing
1      e f dsa fe dfa e d fe asd fe  ting

不完全是一个班轮，但它确实有效。

注意 'aing'在相应列中只有9个字的原因是因为它在我制作的假名单中显得过早。

将一个单词及其前10个单词上下文提取到Python中的数据框

3 个答案: