在Python中抓取关键字和关键字之间的文本

时间:2011-09-19 22:06:04

标签: python search for-loop keyword

我想说的是,这个地方对我的帮助超过了我的回报。我要感谢过去帮助过我的所有事情:)。

我正在尝试从特定样式消息中删除一些文本。它形成如下:

DATA|1|TEXT1|STUFF: some random text|||||
DATA|2|TEXT1|THINGS: some random text and|||||
DATA|3|TEXT1|some more random text and stuff|||||
DATA|4|TEXT1|JUNK: crazy randomness|||||
DATA|5|TEXT1|CRAP: such random stuff I cant believe how random|||||

我有下面显示的代码,它结合了在单词之间添加空格的文本,并将其添加到名为“TEXT”的字符串中,所以它看起来像这样:

STUFF: some random text THINGS: some random text and some more random text and stuff JUNK: crazy randomness CRAP: such random stuff I cant believe how random

我需要像这样形成:

DATA|1|TEXT1|STUFF: |||||
DATA|2|TEXT1|some random text|||||
DATA|3|TEXT1|THINGS: |||||
DATA|4|TEXT1|some random text and|||||
DATA|5|TEXT1|some more random text and stuff|||||
DATA|6|TEXT1|JUNK: |||||
DATA|7|TEXT1|crazy randomness|||||
DATA|8|NEWTEXT|CRAP: |||||
DATA|9|NEWTEXT|such random stuff I cant believe how random|||||

行号很简单,我已经完成了以及carraige返回。我需要的是抓住“CRAP”并将“TEXT1”部分更改为“NEWTEXT”。

我的代码扫描字符串,寻找关键字,然后将它们添加到自己的行,然后在它们下面添加文本,然后在自己的行上添加下一个关键字等。这是我到目前为止的代码:

#this combines all text to one line and adds to a string
while current_segment.move_next('DATA')
    TEXT = TEXT + " " + current_segment.field(4).value

KEYWORD_LIST  = [STUFF:', THINGS:', JUNK:']
KEYWORD_LIST1 = [CRAP:']

#this splits the words up to search through
TEXT_list = TEXT.split(' ')

#this searches for the first few keywords then stops at the unwanted one
for word in TEXT_list:
    if word in KEYWORD_LIST:
        my_output = my_output + word
    elif word in KEYWORD_LIST1:
        break
    else:
        my_output = my_output + ' ' + word

#this searches for the unwanted keywords leaving the output blank until it reaches the wanted keyword
for word1 in TEXT_list:
    if word1 in KEYWORD_LIST:
        my_output1 = ''
    elif word1 in KEYWORD_LIST1:
        my_output1 = my_output1 + word1 + '\n'
    else:
        my_output1 = my_output1 + ' ' + word1

#my_output is formatted back the way I want deviding up the text into 65 or less character lines

MAX_LENGTH = 65
my_wrapped_output  = wrap(my_output,MAX_LENGTH)
my_wrapped_output1 = wrap(my_output1,MAX_LENGTH)
my_output_list     = my_wrapped_output.split('\n')
my_output_list1    = my_wrapped_output1.split('\n')

for phrase in my_output_list:
     if phrase == "":
          SetID +=1
          output = output + "DATA|" + str(SetID) + "|TEXT| |||||"
     else:
          SetID +=1
          output = output + "DATA|" + str(SetID) + "|TEXT|" + phrase + "|||||"

for phrase2 in my_output_list1:
     if phrase2 == "":
          SetID +=1
          output = output + "DATA|" + str(SetID) + "|NEWTEXT| |||||"
     else:
          SetID +=1
          output = output + "DATA|" + str(SetID) + "|NEWTEXT|" + phrase + "|||||"

#this populates the fields I need
value = output

然后我格式化“my_output”和“my_output1”,添加单词“NEWTEXT”。此代码遍历每一行,查找关键字,然后将该关键字和carraige返回。一旦获得另一个“KEYWORD_LIST1”,它将停止并删除其余文本,然后开始下一个循环。我的问题是上面的代码给出了我的这个:

DATA|1|TEXT1|STUFF: |||||
DATA|2|TEXT1|some random text|||||
DATA|3|TEXT1|THINGS: |||||
DATA|4|TEXT1|some random text and|||||
DATA|5|TEXT1|some more random text and stuff|||||
DATA|6|TEXT1|JUNK: |||||
DATA|7|TEXT1|crazy randomness|||||
DATA|8|NEWTEXT|crazy randomness|||||
DATA|9|NEWTEXT|CRAP: |||||
DATA|10|NEWTEXT|such random stuff I cant believe how random|||||

它抓取“KEYWORD_LIST1”之前的文本并将其添加到NEWTEXT部分。我知道有一种方法可以在关键字和文本后面制作组,但我不明白如何实现它。任何帮助将不胜感激。

感谢。

这是我必须做的才能让它为我工作:

KEYWORD_LIST  = ['STUFF:', 'THINGS:', 'JUNK:']
KEYWORD_LIST1 = ['CRAP:']

def text_to_message(text):
    result=[]
    for word in text.split():
        if word in KEYWORD_LIST or word in KEYWORD_LIST1:
            if result:
            yield ' '.join(result)
            result=[]
            yield word
        else:
            result.append(word)
    if result:
        yield ' '.join(result)

def format_messages(messages):
    title='TEXT1'
    for message in messages:
        if message in KEYWORD_LIST:
            title='TEXT1'
        elif message in KEYWORD_LIST1:
            title='NEWTEXT'
    my_wrapped_output  = wrap(message,MAX_LENGTH)
    my_output_list     = my_wrapped_output.split('\n')
    for line in my_output_list:
        if line = '':
            yield title + '|'
        else:
            yield title + '|' + line

for line in format_messages(text_to_message(TEXT)):
    if line = '':
        SetID +=1
        output = "DATA|" + str(SetID) + "|"
    else:
        SetID +=1
        output = "DATA|" + str(SetID) + "|" + line

#this is needed instead of print(line)
value = output 

1 个答案:

答案 0 :(得分:1)

  1. 一般提示:不要试图像这样增加字符串:

    my_output = my_output + ' ' + word
    

    而是,将my_output列为一个列表,将word附加到列表中,然后 然后,在最后,进行一次加入:my_output = ' '.join(my_output)。 (有关示例,请参阅下面的text_to_message代码。) 使用join是the right way to build strings。延迟字符串的创建很有用,因为处理子字符串列表比拆分和拆分字符串更令人愉快,并且必须在此处和那里添加空格和回车。

  2. 研究generators。它们易于理解,在处理这样的文本时可以为您提供很多帮助。


  3. import textwrap
    
    KEYWORD_LIST  = ['STUFF:', 'THINGS:', 'JUNK:']
    KEYWORD_LIST1 = ['CRAP:']
    
    def text_to_message(text):
        result=[]
        for word in text.split():
            if word in KEYWORD_LIST or word in KEYWORD_LIST1:
                if result:
                    yield ' '.join(result)
                    result=[]
                yield word
            else:
                result.append(word)
        if result:
            yield ' '.join(result)
    
    def format_messages(messages):
        title='TEXT1'
        num=1
        for message in messages:
            if message in KEYWORD_LIST:
                title='TEXT1'
            elif message in KEYWORD_LIST1:
                title='NEWTEXT'
            for line in textwrap.wrap(message,width=65):
                yield 'DATA|{n}|{t}|{l}'.format(n=num,t=title,l=line)
                num+=1
    
    TEXT='''STUFF: some random text THINGS: some random text and some more random text and stuff JUNK: crazy randomness CRAP: such random stuff I cant believe how random'''
    
    for line in format_messages(text_to_message(TEXT)):
        print(line)