joining sentences from a list in python3

时间:2018-04-26 17:03:20

标签: python regex list join gensim

I am trying to join a lists of appended sentences into a large a string text object so that I can use it as an input for the Gensim summarize module. However, when I try to do this, it says the input sentences are less than 2. But when I run a split on the text, I see multiple sentences but it counts each sentence once instead of the total of sentences together. And the variable r is a string type object. I would like to concatenate the sentences together into one large string so it can be read through the Gensim summarize module.

Sample Code:

import re
ruling_corpora  = re.findall("\.?([^\.].\*?I find[^\.]*\. |[^\.]*$In sum[^\.]*\. |[^\.]*$agree[^\.]*\.)", tokenized, re.I |re.DOTALL |re.M)[1:-1]

for r in ruling_corpora:                                   
    print(type(r))
    rc= ''.join(r)
    print(summarize(rc))

SAMPLE OUTPUT:

raise ValueError("input must have more than one sentence")
ValueError: input must have more than one sentence

Here is an example of my input I want to summarize with the Gensim summarizer. The numbers underneath each string represent the count of sentences ending in periods:

####Beginning of File### LUMB65.BL23607963.xml
Background Content: ANDERSON INITIAL DECISIONOn January 13, 2015, the appellant filed this appeal arguing that the agency's decision not to renew his term limited appointment which expired on January 28, 2015, is in error.  

 For the reasons discussed below, this appeal is DISMISSED for lack of jurisdiction without a hearing.
1
There is nothing in the agreement that curtails the agency's ability not to extend the term appointment. 
 IdIn reviewing the appellant's arguments, the appellant fails to establish that the Board has jurisdiction to review the agency's decision not to renew his time-limited appointment at issue in this appeal.
 Following a review of the record evidence, I find that the appellant has failed to non-frivolously allege Board jurisdiction over this appeal on any basis.
 Accordingly, this appeal must be dismissed for lack of jurisdiction.
1
####End of File### LUMB65.BL23607963.xml

1 个答案:

答案 0 :(得分:0)

根据the documentation(强调我的):

  

输入应为字符串,且必须长于INPUT_MIN_LENGTH   总结的句子有意义。该文本将分为   使用split_sentences方法的句子   gensim.summarization.texcleaner模块。 请注意换行符号   句子。

尝试使用rc = '\n'.join(r)。您也可以通过调用gensim.summarization.texcleaner.split_sentences来检查结果。

此外,您的正则表达式与您的给定输入不匹配,即使这样做,您也会使用[1:-1]丢弃仅有的两个结果。试试这个:

>>> map(lambda x: x[0], re.findall('([^.]*?(I find|In sum|agree)[^.]*\.)', tokenized, re.I | re.DOTALL | re.M))
["\n1\nThere is nothing in the agreement that curtails the agency's ability not to extend the term appointment.", '\n Following a review of the record evidence, I find that the appellant has failed to non-frivolously allege Board jurisdiction over this appeal on any basis.']

您可能希望首先处理独立号码,因为它们出现在比赛中。