python pandas上下文中的数据框文字:获取前后3个单词

时间:2016-12-13 17:39:18

标签: python pandas text word

我在jupyter笔记本上工作并拥有一个pandas数据框"数据":

Question_ID | Customer_ID | Answer
      1           234         Data is very important to use because ... 
      2           234         We value data since we need it ... 

我想仔细阅读#34;答案"并获得单词" data"之前和之后的三个单词。 所以在这种情况下,我会得到"非常重要&#34 ;; "我们重视","因为我们需要"。

在pandas数据框中有没有好的方法呢?到目前为止,我只找到了解决方案,其中"答案"将是自己的文件运行python代码(没有pandas数据帧)。虽然我意识到我需要使用NLTK库,但之前我还没有使用它,所以我不知道最好的方法是什么。 (这是一个很好的例子Extracting a word and its prior 10 word context to a dataframe in Python

2 个答案:

答案 0 :(得分:1)

这可能有效:

<meta>

输出:

import pandas as pd
import re

df = pd.read_csv('data.csv')

for value in df.Answer.values:
    non_data = re.split('Data|data', value) # split text removing "data"
    terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
    substrs = [term.split()[0:3] for term in terms_list]  # slice and grab first three terms
    result = [' '.join(term) for term in substrs] # combine the terms back into substrings
    print result

答案 1 :(得分:0)

使用生成器表达式re.findallitertools.chain.from_iterable函数的解决方案:

import pandas as pd, re, itertools

data = pd.read_csv('test.csv')  # change with your current file path

data_adjacents = ((i for sublist in (list(filter(None,t))
                         for t in re.findall(r'(\w*?\s*\w*?\s*\w*?\s+)(?=\bdata\b)|(?<=\bdata\b)(\s+\w*\s*\w*\s*\w*)', l, re.I)) for i in sublist)
                            for l in data.Answer.tolist())

print(list(itertools.chain.from_iterable(data_adjacents)))

输出:

[' is very important', 'We value ', ' since we need']