将段落分成python中的句子并链接回ID

时间:2016-02-29 21:09:13

标签: python split sentence

我有两个列表,一个带有id,另一个带有每个id的相应注释。

list_responseid = ['id1', 'id2', 'id3', 'id4'] 

list_paragraph = [['I like working and helping them reach their goals.'],
 ['The communication is broken.',
  'Information that should have come to me is found out later.'],
 ['Try to promote from within.'],
 ['I would relax the required hours to be available outside.',
  'We work a late night each week.']]

ResponseID'id1'与段落相关('我喜欢工作并帮助他们实现目标。')等等。

我可以使用以下函数将段落分成句子:

list_sentence = list(itertools.chain(*list_paragraph))

获取最终结果的语法是什么,数据框(或列表)具有单独的句子条目,并且具有与该句子相关联的ID(现在链接到段落)。最终的结果看起来像这样(我会在最后将列表转换为熊猫数据框)。

id1 'I like working with students and helping them reach their goals.'
id2 'The communication from top to bottom is broken.'
id2 'Information that should have come to me is found out later and in some cases students know more about what is going on than we do!'
id3 'Try to promote from within.'
id4 'I would relax the required 10 hours to be available outside of 8 to 5 back to 9 to 5 like it used to be.'
id4 'We work a late night each week and rarely do students take advantage of those extended hours.'

感谢。

2 个答案:

答案 0 :(得分:1)

如果你经常这样做会更清晰,并且可能更高效,这取决于数组的大小,如果你使用两个常规嵌套循环为它做一个专用函数,但是如果你需要一个快速的一个衬里(它正在这样做):

id_sentence_tuples = [(list_responseid[id_list_idx], sentence) for id_list_idx in range(len(list_responseid)) for sentence in list_paragraph[id_list_idx]]

id_sentence_tuples将是一个tupples列表,其中每个元素都是一对(paragraph_id,sentence),就像你期望的结果一样。 另外我建议你在做之前检查两个列表是否具有相同的长度,以防它们没有得到有意义的错误。

if len(list_responseid) != len(list_paragraph):
    IndexError('Lists must have same cardinality')

答案 1 :(得分:0)

我有一个带有ID和评论的数据框(col = ['ID','Review'])。如果您可以将这些列表组合成一个数据框,则可以使用我的方法。我使用nltk将这些评论分为句子,然后在循环中链接回ID。以下是您可以使用的代码。

## Breaking feedback into sentences
import nltk
count = 0
df_sentences = pd.DataFrame()
for index, row in df.iterrows():
    feedback = row['Reviews']
    sent_text = nltk.sent_tokenize(feedback) # this gives us a list of sentences
    for j in range(0,len(sent_text)):
        # print(index, "-", sent_text[j])
        df_sentences = df_sentences.append({'ID':row['ID'],'Count':int(count),'Sentence':sent_text[j]}, ignore_index=True)
        count = count + 1
print(df_sentences)