使用带有regex dilimiter

时间:2018-06-17 08:33:19

标签: python pandas

我一直在尝试使用read_csv读取一个文本文件university_towns.txt,但是如截图所示,在使用正则表达式分隔符解析文件时,如下面的代码所示,我收到错误:

  

ParserError:第89行预期有2个字段,看到3.错误可能   是因为在使用多字符分隔符时忽略了引号。

有没有办法解决这个问题,似乎只有双引号在一个地方有错,也请解释为什么会发生这种情况?我也尝试使用quotechar参数,但不明白如何使用它。

我的阅读文件代码如下:

university_towns = pd.read_csv('university_towns.txt', sep= "\s\(", engine='python', header=None)

university_towns.txt file image

Annville (Lebanon Valley College)[2]
Bethlehem (Lehigh University, Moravian College)
Bloomsburg (Bloomsburg University of Pennsylvania)[2]
Bradford (University of Pittsburgh at Bradford)
California (California University of Pennsylvania)[2]
Carlisle (Dickinson College)
Cecil B. Moore, Philadelphia, also known as "Templetown" (Temple University)
Clarion (Clarion University of Pennsylvania)[2]
Collegeville (Ursinus College)
Cresson (Mount Aloysius College)[2]
East Stroudsburg (East Stroudsburg University of Pennsylvania)[2]
Edinboro (Edinboro University of Pennsylvania)[2]
Erie (Gannon University, Mercyhurst College, Penn State Erie)
Gettysburg (Gettysburg College)[2]
Greensburg (Seton Hill University, University of Pittsburgh at Greensburg)
Grove City (Grove City College)[2]
Huntingdon (Juniata College)[2]
Indiana (Indiana University of Pennsylvania)[2]
Johnstown (University of Pittsburgh at Johnstown)
Kutztown (Kutztown University of Pennsylvania)[2]
Lancaster (Franklin & Marshall)
Carrollton (University of West Georgia)[2]*Dahlonega (North Georgia College & State University)[2]

上面我粘贴了文本文件的一些行。另外,最后一行是第89行。

2 个答案:

答案 0 :(得分:0)

显示第83列,其他一些可以在这里看到 - 我认为有两个( - \s\(。至少这个错误信息的含义是什么。其他可能的问题是那里有奇怪的特征,解析器就丢失了。我不认为大学的名字是可能的......无论如何 - 看看那条线。如果这不明显 - 与我们分享。

答案 1 :(得分:0)

您的textsampe不会重现错误。您可以通过在调用中添加一些参数来获取导致错误的提示:

pd.read_csv('university_towns.txt', sep= "\s\(", 
            engine='python', header=None , 
            error_bad_lines= False, warn_bad_lines = True)

请参阅pandas.read_csv()

使用

# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

import pandas as pd

with open( 'university_towns.txt', "w") as f:
    f.write("""Annville (Lebanon Valley College)[2]
Bethlehem (Lehigh University, Moravian College)
Bloomsburg (Bloomsburg University of Pennsylvania)[2]
Carrollton (University of West Georgia)[2]*Dahlonega (North Georgia College & State University)[2]""")

university_towns = pd.read_csv('university_towns.txt', sep= "\s\(", engine='python', header=None)

print(university_towns)

会重现您的错误。原因是最后一行(看起来好像它应该是单独的行)包含2个匹配的正则表达式,因此希望分成3列,其中所有其他行只有2列。 =>错误。

要解决此问题,请将违规行拆分为2:

Annville (Lebanon Valley College)[2]
Bethlehem (Lehigh University, Moravian College)
Bloomsburg (Bloomsburg University of Pennsylvania)[2]
Carrollton (University of West Georgia)[2]
Dahlonega (North Georgia College & State University)[2]

它会起作用。