我想在Python 2.7中的长字符串(段落)中提取围绕给定单词的一定数量的单词

时间:2017-04-17 10:33:20

标签: python python-2.7 numbers extract words

我正在尝试提取围绕给定单词的选定数量的单词。我将举例说明:

string =“教育应针对人格的全面发展和加强对人权和基本自由的尊重。”

1)所选单词是开发,我需要得到它周围的6个单词,然后得到:[to,the,full,of,human]

2)但如果所选单词在开头或第二位,我仍然需要得到6个单词,例如:

所选单词必须,我应该得到:[教育,被指导,指向,充满]

我应该使用're'模块。我到目前为止找到的是:

def search(text,n):
'''Searches for text, and retrieves n words either side of the text, which are retuned seperatly'''
word = r"\W*([\w]+)"
groups = re.search(r'{}\W*{}{}'.format(word*n,'place',word*n), text).groups()
return groups[:n],groups[n:]

但它只对第一种情况有帮助。有人可以帮我解决这个问题,我将非常感激。提前谢谢!

5 个答案:

答案 0 :(得分:1)

有可能出现一对一错误,但我认为这符合您的规范。我已经删除了标点符号,最好在发送字符串进行分析之前将其删除。我认为案件并不重要。

test_str = "Education shall be directed to the full development of the human personality and to the strengthening of respect for human rights and fundamental freedoms."

def get_surrounding_words(search_word, s, n_words):
    words = s.lower().split(' ')
    try:
        i = words.index(search_word)
    except ValueError:
        return []
    # Word is near start
    if i < n_words/2:
        words.pop(i)
        return words[:n_words]
    # Word is near end
    elif i >= len(words) - n_words/2:
        words.pop(i)
        return words[-n_words:]
    # Word is in middle
    else:
        words.pop(i)
        return words[i-n_words/2:i+n_words/2]

def test(word):
    print('{}: {}'.format(word, get_surrounding_words(word, test_str, 6)))

test('notfound')
test('development')
test('shall')
test('education')
test('fundamental')
test('for')
test('freedoms')

答案 1 :(得分:1)

这将使用上下文提取文本中所有出现的目标词:

cat   price 
1    1  100000        
2    1  200000        
3    1  300000        
4    1  400000        
6    2  100000        
7    2  200000        
8    2   50000        
9    2  100000        
10   2 1000000        

答案 2 :(得分:0)

import sys, os

args = sys.argv[1:]
if len(args) != 2:
   os.exit("Use with <string> <query>")
text = args[0]
query = args[1]
words = text.split()
op = []
left = 3
right = 3
try:
    index = words.index(query)
    if index <= left:
        start = 0
    else:
        start = index - left

    if start + left + right + 1 > len(words):
        start = len(words) - left - right - 1
        if start < 0:
            start = 0

    while len(op) < left + right and start < len(words):
        if start != index:
            op.append(words[start])
        start += 1
except ValueError:
    pass
print op
  • 这怎么工作?
    1. 在字符串中找到单词
    2. 看看我们是否可以从索引
    3. 创建左+右词
    4. 左/右数字并将其保存在op
    5. print op

答案 3 :(得分:0)

解决问题的简单方法。首先分离所有单词,然后从左和右选择单词。

def custom_search(sentence, word, n):     
    given_string = sentence
    given_word = word
    total_required = n
    word_list = given_string.strip().split(" ")
    length_of_words = len(word_list)

    output_list = []
    given_word_position = word_list.index(given_word)
    word_from_left = 0
    word_from_right = 0

    if given_word_position + 1 > total_required / 2:
        word_from_left = total_required / 2
        if given_word_position + 1 + (total_required / 2) <= length_of_words:
            word_from_right = total_required / 2
        else:
            word_from_right = length_of_words - (given_word_position + 1)
            remaining_words = (total_required / 2) - word_from_right
            word_from_left += remaining_words

    else:
        word_from_right = total_required / 2
        word_from_left = given_word_position
        if word_from_left + word_from_right < total_required:
            remaining_words = (total_required / 2) - word_from_left
            word_from_right += remaining_words

    required_words = []
    for i in range(given_word_position - word_from_left, word_from_right + 
    given_word_position + 1):
        if i != given_word_position:
            required_words.append(word_list[i])
    return required_words


sentence = "Education shall be directed to the full development of the human personality and to the strengthening of respect for human rights and fundamental freedoms."
custom_search(sentence, "shall", 6)

>>[Education, be, directed, to , the , full] 


custom_search(sentence, "development", 6)

>>['to', 'the', 'full', 'of', 'the', 'human'] 

答案 4 :(得分:0)

我不认为这里需要正则表达式。假设文本构造良好,只需将其拆分为一个单词数组,并编写几个if-else语句,以确保它检索必要数量的周围单词:

def search(text, word, n):
    # text is the string you are searching
    # word is the word you are looking for
    # n is the TOTAL number of words you want surrounding the word

    words    = text.split(" ")  # Create an array of words from the string
    position = words.index(word)   # Find the position of the desired word

    distance_from_end = len(words) - position  # How many words are after the word in the text

    if position < n // 2 + n % 2:  # If there aren't enough words before...
        return words[:position], words[position + 1:n + 1]

    elif distance_from_end < n // 2 + n % 2:  # If there aren't enough words after...
        return words[position - n + distance_from_end:position], words[position + 1:]

    else:  # Otherwise, extract an equal number of words from both sides (take from the right if odd)
        return words[position - n // 2 - n % 2:position], words[position + 1:position + 1 + n//2]

string = "Education shall be directed to the full development of the human personality and to the \
strengthening of respect for human rights and fundamental freedoms."

print search(string, "shall", 6)
# >> (['Education'], ['be', 'directed', 'to', 'the', 'full'])

print search(string, "human", 5)
# >> (['development', 'of', 'the'], ['personality', 'and'])

在你的例子中,你没有在输出中包含目标词,所以我也把它保留了下来。如果您喜欢包含的目标词,只需将两个数组合并,函数返回(在position加入它们)。

希望这有帮助!