我是Python(2.7)的新手,所以如果这是一个非常直截了当的问题,请原谅我。我希望(i)从已经用NLTK库标记的文本中提取以 -ing 结尾的所有单词,并且(ii)提取在这样提取的每个单词之前的10个单词。然后,我希望(iii)将这些文件保存为两列的数据框,可能类似于:
Word PreviousContext
starting stood a moment, as if in a troubled reverie; then
seeming of it retraced our steps. But Elijah passed on, without
purchasing a sharp look-out upon the hands: Bildad did all the
我知道如何做(i),但我不知道如何去做(ii) - (iii)。任何帮助将不胜感激和承认。到目前为止,我有:
>>> import bs4
>>> import nltk
>>> from nltk import word_tokenize
>>> url = "http://www.gutenberg.org/files/766/766-h/766-h.htm"
>>> import urllib
>>> response = urllib.urlopen(url)
>>> raw = response.read().decode('utf8')
>>> tokens = word_tokenize(raw)
>>> for w in tokens:
... if w.endswith("ing"):
... print(w)
...
padding
padding
encoding
having
heading
wearying
dismissing
going
nothing
reading etc etc etc..
答案 0 :(得分:1)
代码行之后:
>>> tokens = word_tokenize(raw)
使用以下代码生成带有上下文的单词:
>>> context={}
>>> for i,w in enumerate(tokens):
... if w.endswith("ing"):
... try:
... context[w]=tokens[i:i+10] # this try...except is used to pass last 10 words whose context is less than 10 words.
... except: pass
...
>>> fp=open('dataframes','w') # save results in this file
>>> fp.write('Word'+'\t\t'+'PreviousContext\n')
>>> for word in context:
... fp.write(word+'\t\t'+' '.join(context[word])+'\n')
...
>>> fp.close()
>>> fp=open('dataframes','r')
>>> for line in fp.readlines()[:10]: # first 10 lines of generated file
... print line
...
Word PreviousContext
raining raining , and I saw more fog and mud in
bidding bidding him good night , if he were yet sitting
growling growling old Scotch Croesus with great flaps of ears ?
bright-looking bright-looking bride , I believe ( as I could not
hanging hanging up in the shop&mdash ; went down to look
scheming scheming and devising opportunities of being alone with her .
muffling muffling her hands in it , in an unsettled and
bestowing bestowing them on Mrs. Gummidge. She was with him all
adorning adorning , the perfect simplicity of his manner , brought
有两点需要注意:
答案 1 :(得分:0)
如果你问的是如何在算法上做这个,开始我将始终保持前10个字的队列和数据帧,其中第一列是以'ing'结尾的单词,第二列是相应单词前面的10个单词(在第一列中)。
因此,在程序开始时,队列将为空,然后对于前10个单词,它会将每个单词排入队列。然后每次在循环中向前移动之前,将当前单词排队并使单词出列(确保维护一个大小为10的队列)。
这样,在每次迭代时,您都会检查单词是否以“ing”结尾。如果是,请在数据框中添加一行,其中单词是第一个项目,第二个项目是队列的当前状态。
最后,你应该有一个数据框,第一列的单词以'ing'结尾,而相应的第二列是前面的10个单词。
答案 2 :(得分:0)
假设您在单词列表中包含所有单词:
>>> words
['abc', 'def', 'gdi', 'asd', 'ew', 'd', 'ew', 'fdsa', 'dsa', 'aing', 'e', 'f', 'dsa', 'fe', 'dfa', 'e', 'd', 'fe', 'asd', 'fe', 'ting']
我会把它们放到一个系列中并抓住相关单词的索引:
words = pandas.Series(words)
idx = pandas.Series(words[words.apply(lambda x: x.endswith('ing'))].index)
>>> idx
0 9
1 20
dtype: int64
现在,idx
的值是我们原始'ing'
中以Series
结尾的单词的索引。接下来,我们需要将这些值转换为范围:
starts = idx - 10
ends = idx
现在我们可以使用这些范围索引原始系列(首先,如果'ing'
单词在列表中出现少于10个单词,则下限为0的剪辑):
starts = starts.clip(0)
df = pandas.DataFrame([{
'word': words[e],
'Previous':string.join(words[s:e])} for s,e in zip(starts,ends)])
>>> df
Previous word
0 abc def gdi asd ew d ew fdsa dsa aing
1 e f dsa fe dfa e d fe asd fe ting
不完全是一个班轮,但它确实有效。
注意 'aing'
在相应列中只有9个字的原因是因为它在我制作的假名单中显得过早。