将一个单词及其前10个单词上下文提取到Python中的数据框

时间:2014-11-14 18:13:34

标签: python extract

我是Python(2.7)的新手,所以如果这是一个非常直截了当的问题,请原谅我。我希望(i)从已经用NLTK库标记的文本中提取以 -ing 结尾的所有单词,并且(ii)提取在这样提取的每个单词之前的10个单词。然后,我希望(iii)将这些文件保存为两列的数据框,可能类似于:

Word        PreviousContext 
starting    stood a moment, as if in a troubled reverie; then
seeming     of it retraced our steps. But Elijah passed on, without
purchasing  a sharp look-out upon the hands: Bildad did all the

我知道如何做(i),但我不知道如何去做(ii) - (iii)。任何帮助将不胜感激和承认。到目前为止,我有:

>>> import bs4 
>>> import nltk
>>> from nltk import word_tokenize
>>> url = "http://www.gutenberg.org/files/766/766-h/766-h.htm"
>>> import urllib
>>> response = urllib.urlopen(url)
>>> raw = response.read().decode('utf8')
>>> tokens = word_tokenize(raw)
>>> for w in tokens:
...     if w.endswith("ing"):
...             print(w)
... 
padding
padding
encoding
having
heading
wearying
dismissing
going
nothing
reading etc etc etc.. 

3 个答案:

答案 0 :(得分:1)

代码行之后:

>>> tokens = word_tokenize(raw)

使用以下代码生成带有上下文的单词:

>>> context={}
>>> for i,w in enumerate(tokens):
...      if w.endswith("ing"):
...         try:
...            context[w]=tokens[i:i+10]  # this try...except is used to pass last 10 words whose context is less than 10 words.
...         except: pass
... 
>>> fp=open('dataframes','w')   # save results in this file
>>> fp.write('Word'+'\t\t'+'PreviousContext\n')
>>> for word in context:
...    fp.write(word+'\t\t'+' '.join(context[word])+'\n')
... 
>>> fp.close()
>>> fp=open('dataframes','r')  
>>> for line in fp.readlines()[:10]: # first 10 lines of generated file
...    print line
... 
Word                PreviousContext
raining             raining , and I saw more fog and mud in
bidding             bidding him good night , if he were yet sitting
growling            growling old Scotch Croesus with great flaps of ears ?
bright-looking      bright-looking bride , I believe ( as I could not
hanging             hanging up in the shop&mdash ; went down to look
scheming            scheming and devising opportunities of being alone with her .
muffling            muffling her hands in it , in an unsettled and
bestowing           bestowing them on Mrs. Gummidge. She was with him all
adorning            adorning , the perfect simplicity of his manner , brought

有两点需要注意:

  1. nltk将标点符号视为单独的标记,因此标点符号被视为单独的单词。
  2. 我使用字典来存储带有上下文的单词,因此单词的顺序将无关紧要,但保证所有带有上下文的单词都存在。

答案 1 :(得分:0)

如果你问的是如何在算法上做这个,开始我将始终保持前10个字的队列和数据帧,其中第一列是以'ing'结尾的单词,第二列是相应单词前面的10个单词(在第一列中)。

因此,在程序开始时,队列将为空,然后对于前10个单词,它会将每个单词排入队列。然后每次在循环中向前移动之前,将当前单词排队并使单词出列(确保维护一个大小为10的队列)。

这样,在每次迭代时,您都会检查单词是否以“ing”结尾。如果是,请在数据框中添加一行,其中单词是第一个项目,第二个项目是队列的当前状态。

最后,你应该有一个数据框,第一列的单词以'ing'结尾,而相应的第二列是前面的10个单词。

答案 2 :(得分:0)

假设您在单词列表中包含所有单词:

>>> words
['abc', 'def', 'gdi', 'asd', 'ew', 'd', 'ew', 'fdsa', 'dsa', 'aing', 'e', 'f', 'dsa', 'fe', 'dfa', 'e', 'd', 'fe', 'asd', 'fe', 'ting']

我会把它们放到一个系列中并抓住相关单词的索引:

words =  pandas.Series(words)
idx = pandas.Series(words[words.apply(lambda x: x.endswith('ing'))].index)
>>> idx
0     9
1    20
dtype: int64

现在,idx的值是我们原始'ing'中以Series结尾的单词的索引。接下来,我们需要将这些值转换为范围:

starts = idx - 10
ends = idx

现在我们可以使用这些范围索引原始系列(首先,如果'ing'单词在列表中出现少于10个单词,则下限为0的剪辑):

starts = starts.clip(0)
df = pandas.DataFrame([{
    'word': words[e], 
    'Previous':string.join(words[s:e])} for s,e in zip(starts,ends)])
>>> df
                           Previous  word
0  abc def gdi asd ew d ew fdsa dsa  aing
1      e f dsa fe dfa e d fe asd fe  ting

不完全是一个班轮,但它确实有效。

注意 'aing'在相应列中只有9个字的原因是因为它在我制作的假名单中显得过早。