如何从Python中删除csv文件中的2个连续换行符?

时间:2017-04-14 03:35:48

标签: python

我试过这段代码:

import re
re.sub('\r\n\r\n','','Summary_csv.csv')

它没有做任何事情。在,它甚至没有触摸文件(运行此代码后没有修改文件的日期和时间)。有人可以解释一下原因吗?

然后我尝试了这个:

import re
output = open("Summary.csv","w", encoding="utf8")
input = open("Summary_csv.csv", encoding="utf8")

for line in input:
    output.write(re.sub('\r\n\r\n','', line))

input.close()
output.close()

这个对文件做了一些操作,因为在我运行此代码后修改的数据和文件中的时间发生了变化,但它不会删除连续的换行符,并且输出与原始文件相同。

编辑:这是原始csv文件中的一小部分示例:

"The UK’s Civil Aviation Authority (CAA) has announced new passenger charge caps for Heathrow and Gatwick while deregulating Stansted.  Under the Civil Aviation Act 2012 for the economic regulation of UK airport operators, the CAA conducts market power assessments (MPA) to judge their power within the aviation market and whether they need to be regulated. (....) As expected, the CAA’s price review published on January 10 requires Heathrow and Gatwick to continue their regulated status, though Stansted has been de-regulated, giving operator MAG the power to determine what levies are necessary.

Although the CAA had previously said Heathrow would be allowed to increase its charges in line with inflation, Heathrow and Gatwick’s price rises will be limited to 1.5% below the rate of inflation from April 1.  These rules will run until December 31, 2018, for Heathrow and until March 31, 2021 for Gatwick.  (....) CAA's Chair, Dame Deidre Hutton commented: “[Passengers] will see prices fall, whilst still being able to look forward to high service standards, thanks to a robust licensing regime.” Heathrow has stated the CAA’s price caps will result in its per passenger airline charges falling in real terms from £20.71 in 2013/14 to £19.10 in 2018/19. (....)


"


"The CAPA Airport Construction and Capex database presently has over USD385 billion of projects indicated globally, led by Asia with just over USD115 billion of projects either in progress or planned for and with a good chance of completion. China, with 69 regional airports to be constructed by 2015, is the most active, adding to the existing 193. But some Asian countries, notably India and Indonesia, each with extended near-or more than double digit growth, are lagging badly in introducing new infrastructure.

The Middle East is also undertaking major investment, notably in the Gulf airports, as the world-changing operations of its main airlines continue to expand rapidly. But Saudi Arabia and Oman are also embarked on major expansions.

Istanbul's new airport starts to take shape in 2014, with completion of the world's biggest facility due to be completed by 2019. Meanwhile, in Brazil, the race is on to have sufficient capacity in place for the football world cup, due to commence in Jun-2014. (....)

"

我希望输出如下:

"The UK’s Civil Aviation Authority (CAA) has announced new passenger charge caps for Heathrow and Gatwick while deregulating Stansted.  Under the Civil Aviation Act 2012 for the economic regulation of UK airport operators, the CAA conducts market power assessments (MPA) to judge their power within the aviation market and whether they need to be regulated. (....) As expected, the CAA’s price review published on January 10 requires Heathrow and Gatwick to continue their regulated status, though Stansted has been de-regulated, giving operator MAG the power to determine what levies are necessary. Although the CAA had previously said Heathrow would be allowed to increase its charges in line with inflation, Heathrow and Gatwick’s price rises will be limited to 1.5% below the rate of inflation from April 1.  These rules will run until December 31, 2018, for Heathrow and until March 31, 2021 for Gatwick.  (....) CAA's Chair, Dame Deidre Hutton commented: “[Passengers] will see prices fall, whilst still being able to look forward to high service standards, thanks to a robust licensing regime.” Heathrow has stated the CAA’s price caps will result in its per passenger airline charges falling in real terms from £20.71 in 2013/14 to £19.10 in 2018/19. (....)"


"The CAPA Airport Construction and Capex database presently has over USD385 billion of projects indicated globally, led by Asia with just over USD115 billion of projects either in progress or planned for and with a good chance of completion. China, with 69 regional airports to be constructed by 2015, is the most active, adding to the existing 193. But some Asian countries, notably India and Indonesia, each with extended near-or more than double digit growth, are lagging badly in introducing new infrastructure.The Middle East is also undertaking major investment, notably in the Gulf airports, as the world-changing operations of its main airlines continue to expand rapidly. But Saudi Arabia and Oman are also embarked on major expansions.Istanbul's new airport starts to take shape in 2014, with completion of the world's biggest facility due to be completed by 2019. Meanwhile, in Brazil, the race is on to have sufficient capacity in place for the football world cup, due to commence in Jun-2014. (....)"

1 个答案:

答案 0 :(得分:2)

您的问题的答案是re.sub正在应用于字符串'Summary_csv.csv'而不是文件。它期望第三个参数的字符串,并在该字符串上进行替换。

在第二段代码中,您打开文件并一次读取一行。这意味着任何行都不会包含两个换行符。两个换行将导致从输入文件返回两个连续行,第二行为空。

要删除额外的新行,只需测试空白line,不要将其写入output。在空行(仅包含空格字符的行)上调用line.strip()将返回一个空字符串,该字符串将在False语句中计算为if。如果line.strip()不为空,请将其写入输出文件。

output = open("Summary.csv","w", encoding="utf8")
infile = open("Summary_csv.csv", encoding="utf8")

for line in infile:
    if line.strip():
        output.write(line)

infile.close()
output.close()

注意:Python以独立于平台的方式处理文本文件,并将行结尾转换为' \ n'默认情况下,测试' \ r \ n'即使没有其他问题,也不会工作。如果您确实希望结尾为' \ r \ n',则在为输入文件调用newline='\r\n'时必须指定open()。有关完整说明,请参阅https://docs.python.org/3/library/functions.html#open上的文档。

第二部分

通过OP发布的示例输入和输出文件,似乎问题比剥离额外的换行更复杂。以下代码读取输入文件,在"个字符对之间查找文本,并将所有行组合到输出文件中的单行上。不在"内的额外换行符将不加改变地发送到输出文件。

import re
outfile = open("Summary.csv","w", encoding="utf8")
infile = open("Summary_csv.csv", encoding="utf8")

text = infile.read()
text = re.sub('\n\n', '\n', text) #remove double newlines
for p in re.split('(\".+?\")', text, flags=re.DOTALL):
    if p: #skip empty matches
        if p.strip(): #this is a paragraph of text and should be a line
            p = p[1:-2] #get everything between the quotes
            p = p.strip() #remove leading and trailing whitespace
            p = re.sub('\n+', '  ', p) #replace any remaining \n with two spaces
            p = '"' + p + '"\n' #replace the " around the paragraph and add newline
        outfile.write(p)

infile.close()
outfile.close()