Python:用特定单词提取句子

时间:2014-11-22 06:55:14

标签: python regex nltk

我有一个包含以下文本的json文件:

  

博士。戈德堡提供一切。停车很好。他很好,很容易   谈

如何使用关键字“停车”提取句子? 我不需要另外两句话。

我试过了:

with open("test_data.json") as f:
    for line in f:
        if "parking" in line:
            print line

它打印所有文本而不是特定句子。

我甚至尝试过使用正则表达式:

f=open("test_data.json")
for line in f:
    line=line.rstrip()
    if re.search('parking',line):
        print line

即使这显示相同的结果。

3 个答案:

答案 0 :(得分:4)

您可以使用nltk.tokenize

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
f=open("test_data.json").read()
sentences=sent_tokenize(f)
my_sentence=[sent for sent in sentences if 'parking' in word_tokenize(sent)] #this gave you the all sentences that your special word is in it ! 

作为一种完整的方式,你可以使用一个函数:

>>> def sentence_finder(text,word):
...    sentences=sent_tokenize(text)
...    return [sent for sent in sentences if word in word_tokenize(sent)]

>>> s="dr. goldberg offers everything. parking is good. he's nice and easy to talk"
>>> sentence_finder(s,'parking')
['parking is good.']

答案 1 :(得分:0)

您可以使用标准库re模块:

import re
line = "dr. goldberg offers everything.parking is good.he's nice and easy to talk"
res = re.search("\.?([^\.]*parking[^\.]*)", line)
if res is not None:
    print res.group(1)

它会打印parking is good

想法很简单 - 您从可选点字符.开始搜索句子,而不是消耗所有非点,parking字和其他非点。

问号处理你的句子在行的开头的情况。

答案 2 :(得分:0)

如何解析字符串并查看值?

import json

def sen_or_none(string):
  return "parking" in string.lower() and string or None

def walk(node):
  if isinstance(node, list):
    for item in node:
      v = walk(item)
      if v:
        return v
  elif isinstance(node, dict):
    for key, item in node.items():
      v = walk(item)
      if v:
        return v
  elif isinstance(node, basestring):
    for item in node.split("."):
      v = sen_or_none(item)
      if v:
        return v
  return None

with open('data.json') as data_file:    
  print walk(json.load(data_file))