删除标点符号后,将非结构化文本重新格式化为单行

时间:2015-02-25 15:07:09

标签: python string

我有一个非结构化文本,我想将其转换为1行并删除所有标点符号。

对于标点符号,我使用了以下解决方案Best way to strip punctuation from a string in Python


如何使用python将非结构化文本重新格式化为1行?


示例1:

  

Bourne Identity是一部基于罗伯特的2002年间谍片   Ludlum的同名小说。它将Matt Damon饰演为Jason Bourne,   一个失忆者试图在一个人中发现他的真实身份   中央情报局(CIA)内部的秘密阴谋   跟踪他,逮捕或杀死他莫名其妙地失败了   进行正式的未经批准的暗杀然后没有   之后再报告。一路上他与玛丽合作,   由Franka Potente扮演,他帮助他完成了他的初始部分   了解他的过去并重获记忆的旅程。电影也   明星Chris Cooper饰演Alexander Conklin,Clive Owen担任教授,   Brian Cox饰演Ward Abbott,Julia Stiles饰演Nicky Parsons。

     这部电影由Doug Liman执导,并由Tony改编为银幕   Gilroy和William Blake Herron来自同名小说   由罗伯特·卢德鲁姆(Robert Ludlum)撰写,他与弗兰克一起制作了这部电影   马歇尔。环球影城将影片发布到影院中   美国于2002年6月14日获得了积极的批评和批评   公众反应。这部电影之后是2004年的续集“伯恩”   Supremacy,以及2007年发布的第三部分,名为The Bourne   最后通

     

剧情


示例2:

12 (0) 0 4 (0)  38 (3) 0 3 (0) 0 1 (0)

示例3:

Franklin Township is one of the eighteen townships of Monroe County, Ohio,
United States. The 2000 census found 453 people in the township, 367 of whom
lived in the unincorporated portions of the township.

 Geography

Located in the western part of the county, it borders the following townships:

The village of Stafford lies in southwestern Franklin Township.

 Name and history

It is one of twenty-one Franklin Townships statewide.

 Government

The township is governed by a three-member board of trustees, who are elected in
November of odd-numbered years to a four-year term beginning on the following
January 1. Two are elected in the year after the presidential election and one
is elected in the year before it. There is also an elected township clerk, who
serves a four-year term beginning on April 1 of the year after the election,
which is held in November of the year before the presidential election.
Vacancies in the clerkship or on the board of trustees are filled by the
remaining trustees.

正如您在前面的示例中所看到的那样。文本有不同的格式。如何将每个文本转换为1行?

1 个答案:

答案 0 :(得分:2)

这很简单 - 基本上,除了标点符号之外,你现在也想要消除行结尾。

所以,你可以这样做:

import string
exclude = set(string.punctuation +  "\n\t\r")
print ''.join(ch for ch in input_string if ch not in exclude)
  

input_string =“”“Bourne Identity是一部2002年的间谍片,基于Robert Ludlum的同名小说。它将Matt Damon饰演为Jason Bourne,一名失忆者试图在中央内部的秘密阴谋中发现他的真实身份情报局(CIA)跟踪他并逮捕或杀死他,因为莫名其妙地没有进行正式的未经批准的暗杀事件,然后未能在事后报告。他一路上与玛丽搭档,由Franka Potente扮演,他协助他在他的过去的初始阶段,他将了解他的过去并重新获得他的记忆。影片中还有Chris Cooper担任Alexander Conklin,Clive Owen担任教授,Brian Cox担任Ward Abbott,Julia Stiles担任Nicky Parsons。

     这部电影由道格·李曼执导,并由罗伯特·卢德鲁姆(Robert Ludlum)撰写的同名小说中的托尼·吉尔罗伊(Tony Gilroy)和威廉·布莱克·赫伦(William Blake Herron)改编为电影,后者也与弗兰克·马歇尔(Frank Marshall)合作制作了这部电影。环球影城于2002年6月14日在美国影院上映,并获得了积极的批评和公众反应。影片之后是2004年的续集“谍影重重”,第三部分是2007年发行的题为“谍影重重”的作品。“”

>>> print ''.join(ch for ch in input_string if ch not in exclude)
The Bourne Identity is a 2002 spy film loosely based on Robert Ludlums novel of the same name It stars Matt Damon as Jason Bourne an amnesiac attempting to discover his true identity amidst a clandestine conspiracy within the Central Intelligence Agency CIA to track him down and arrest or kill him for inexplicably failing to carry out an officially unsanctioned assassination and then failing to report back in afterwards Along the way he teams up with Marie played by Franka Potente who assists him on the initial part of his journey to learn about his past and regain his memories The film also stars Chris Cooper as Alexander Conklin Clive Owen as The Professor Brian Cox as Ward Abbott and Julia Stiles as Nicky ParsonsThe film was directed by Doug Liman and adapted for the screen by Tony Gilroy and William Blake Herron from the novel of the same name written by Robert Ludlum who also produced the film alongside Frank Marshall Universal Studios released the film to theaters in the United States on June 14 2002 and it received a positive critical and public reaction The film was followed by a 2004 sequel The Bourne Supremacy and a third part released in 2007 entitled The Bourne Ultimatum