Python如何根据子字符串过滤字符串

时间:2013-07-09 07:57:36

标签: python csv substring

我是来自Java世界的Python新手。

  1. 我正在尝试编写一个简单的python函数,只打印出CSV或“arff”文件的数据行。非数据行以这3种模式开头@,[@,[%,不应打印此类行。

  2. 示例数据文件摘要:

    % 1. Title: Iris Plants Database
    % 
    % 2. Sources:
    
    %      (a) Creator: R.A. Fisher
    %      (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    %      (c) Date: July, 1988
    
    @RELATION iris
    
    @ATTRIBUTE sepallength  REAL
    @ATTRIBUTE sepalwidth   REAL
    @ATTRIBUTE petallength  REAL
    @ATTRIBUTE petalwidth   REAL
    @ATTRIBUTE class    {Iris-setosa,Iris-versicolor,Iris-virginica}
    
    @DATA
    5.1,3.5,1.4,0.2,Iris-setosa
    4.9,3.0,1.4,0.2,Iris-setosa
    4.7,3.2,1.3,0.2,Iris-setosa
    4.6,3.1,1.5,0.2,Iris-setosa
    5.0,3.6,1.4,0.2,Iris-setosa
    5.4,3.9,1.7,0.4,Iris-setosa
    
  3. Python脚本:

    import csv
    def loadCSVfile (path):
        csvData = open(path, 'rb') 
        spamreader = csv.reader(csvData, delimiter=',', quotechar='|')
        for row in spamreader:
            if row.__len__ > 0:
                #search the string from index 0 to 2 and if these substrings(@ ,'[\'%' , '[\'@') are not found, than print the row
                if (str(row).find('@',0,1) & str(row).find('[\'%',0,2) & str(row).find('[\'@',0,2) != 1):
                    print str(row)
    
    loadCSVfile('C:/Users/anaim/Desktop/Data Mining/OneR/iris.arff')
    

    实际输出:

    ['% 1. Title: Iris Plants Database']
    ['% ']
    ['% 2. Sources:']
    ['%      (a) Creator: R.A. Fisher']
    ['%      (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)']
    ['%      (c) Date: July', ' 1988']
    ['% ']
    []
    ['@RELATION iris']
    []
    ['@ATTRIBUTE sepallength\tREAL']
    ['@ATTRIBUTE sepalwidth \tREAL']
    ['@ATTRIBUTE petallength \tREAL']
    ['@ATTRIBUTE petalwidth\tREAL']
    ['@ATTRIBUTE class \t{Iris-setosa', 'Iris-versicolor', 'Iris-virginica}']
    []
    ['@DATA']
    ['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
    ['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
    ['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
    ['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']
    ['5.0', '3.6', '1.4', '0.2', 'Iris-setosa']
    ['5.4', '3.9', '1.7', '0.4', 'Iris-setosa']
    ['4.6', '3.4', '1.4', '0.3', 'Iris-setosa']
    ['5.0', '3.4', '1.5', '0.2', 'Iris-setosa']
    

    期望的输出:

    ['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
    ['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
    ['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
    ['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']
    ['5.0', '3.6', '1.4', '0.2', 'Iris-setosa']
    ['5.4', '3.9', '1.7', '0.4', 'Iris-setosa']
    ['4.6', '3.4', '1.4', '0.3', 'Iris-setosa']
    ['5.0', '3.4', '1.5', '0.2', 'Iris-setosa']
    

2 个答案:

答案 0 :(得分:2)

要测试行是否为空,只需在布尔上下文中使用它;空列表是假的。

要测试字符串是否以某些特定字符开头,请使用str.startswith(),它可以使用单个字符串或字符串元组:

import csv
def loadCSVfile (path):
    with open(path, 'rb') as csvData:
        spamreader = csv.reader(csvData, delimiter=',', quotechar='|')
        for row in spamreader:
            if row and not row[0].startswith(('%', '@')):
                print row

因为您确实在测试固定宽度的字符串,所以您也可以只切片第一列并使用in对序列进行测试;一套最有效:

def loadCSVfile (path):
    ignore = {'@', '%'}
    with open(path, 'rb') as csvData:
        spamreader = csv.reader(csvData, delimiter=',', quotechar='|')
        for row in spamreader:
            if row and not row[0][:1] in ignore:
                print row

此处[:1]切片表示法返回row[0]列的第一个字符(如果第一列为空,则返回空字符串)。

我使用open文件对象作为上下文管理器(with ... as ...),以便Python在代码块完成时自动关闭文件(或引发异常)。

你不应该直接调用双下划线方法(“dunder”方法或特殊方法),而是正确的API调用{/ 1}}。

演示:

len(row)

答案 1 :(得分:0)

我会利用in运算符和Python列表理解。

这就是我的意思:

import csv

def loadCSVfile (path):
    exclusions = ['@', '%', '\n', '[@' , '[%']
    csvData = open(path, 'r')
    spamreader = csv.reader(csvData, delimiter=',', quotechar='|')      

    lines = [line for line in spamreader if ( line and line[0][0:1] not in exclusions and line[0][0:2] not in exclusions )]

    for line in lines:
        print(line)


loadCSVfile('C:/Users/anaim/Desktop/Data Mining/OneR/iris.arff')