从我的文本对象的开头删除单词?

时间:2015-11-10 18:27:33

标签: python text nlp nltk

我有一个功能可以从millercenter.org中删除语音并返回已处理的语音。但是,我的每一个演讲都有“#34; transcript"在开始时(那就是它如何被编码到HTML中)。所以,我的所有文本文件都是这样的:

\n <--- there's really just a new line, here, not literally '\n'
transcript

fourscore and seven years ago, blah blah blah

我将这些保存在我的U:/驱动器中 - 如何迭代这些文件并删除&#39;成绩单&#39;?文件看起来像这样,基本上是:

link

编辑:

speech_dict = {}
for filename in glob.glob("U:/FALL 2015/ENGL 305/NLP Project/Speeches/*.txt"):
    with open(filename, 'r') as inputFile:
        filecontent = inputFile.read();
        filecontent.replace('transcript','',1)
        speech_dict[filename] = filecontent # put the speeches into a dictionary to run through the algorithm

这无法改变我的演讲。 &#39;成绩单&#39;还在那里。

我也尝试将它放入我的文本处理功能中,但这不起作用:

def processURL(l):
        open_url = urllib2.urlopen(l).read()
        item_soup = BeautifulSoup(open_url)
        item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
        item_str = item_div.text.lower()
        item_str_processed = punctuation.sub(' ',item_str)
        item_str_processed_final = item_str_processed.replace('—',' ').replace('transcript','',1)

        splitlink = l.split("/")
        president = splitlink[4]
        speech_num = splitlink[-1]
        filename = "{0}_{1}".format(president, speech_num)

        return filename, item_str_processed_final # giving back filename and the text itself

以下是我在processURL中运行的示例网址:http://millercenter.org/president/harding/speeches/speech-3805

2 个答案:

答案 0 :(得分:3)

您可以使用Python的优秀success: function(data){ var str = ''; for(var i = 0; i < data.length; i++) { str += '<li>' + JSON.parse(data).lecture_name + '</li>'; } $('#lecture-container-body').html(str); }

replace()

此行将data = data.replace('transcript', '', 1) 替换为'transcript'(空字符串)。最后一个参数是要进行的替换次数。 1仅适用于''的第一个实例,所有实例均为空白。

答案 1 :(得分:2)

如果你知道你想要的数据总是从第x行开始,那么就这样做:

with open('filename.txt', 'r') as fin:
    for _ in range(x): # This loop will skip x no. of lines.
        next(fin)
    for line in fin:
        # do something with the line.
        print(line)

或者假设你想要在成绩单之前删除任何行:

with open('filename.txt', 'r') as fin:
    while next(fin) != 'transcript': # This loop will skip lines until it reads the *transcript* lines.
        break
    # if you want to skip the empty line after *transcript*
    next(fin) # skips the next line.
    for line in fin:
        # do something with the line.
        print(line)