使用Pandas解析大型txt文件

时间:2015-10-23 22:45:04

标签: python csv pandas

我有一个很大的.txt文件,接近2GB。首先我试过这个 pd.read_csv("large_text_file.txt", header=0, delim_whitespace=True) 不断抛出Error tokenizing data. C error: Expected 32 fields in line 3, saw 36错误

然后我尝试了这个:pd.read_csv("wspace.csv", header=0, sep=r"\s+")但是因为它在技术上不是csv文件,而某些内容如“名称”中有空格,输出结果非常糟糕。使用pandas error_bad_lines=False并不理想,因为它只是跳过这个特定表的所有行。有没有办法将大型txt文件格式化为csv?我知道你可以将扩展名从.txt更改为.csv但是如果没有逗号,那么他的输出结果很差。示例数据来说明我的痛苦。

而不是显示如下内容: ╔══════════════════════════════════╦═════════╦════════════════════════╗ ║ Col1 ║ Col2 ║ Col3 ║ ╠══════════════════════════════════╬═════════╬════════════════════════╣ ║ Value 1 ║ Value 2 ║ 123 ║ ║ Separate ║ cols ║ with a tab or 4 spaces ║ ║ This is a row with only one cell ║ ║ ║ ╚══════════════════════════════════╩═════════╩════════════════════════╝ 它显示了这一点,导致“令牌错误” ╔════════════════╦════════════════════╦════════════════════════╦═══╦══╦═════╗ ║ Col1 Col2 Col3 ║ ║ ║ ║ ║ ║ ╠════════════════╬════════════════════╬════════════════════════╬═══╬══╬═════╣ ║ Value ║ 1 ║ Value ║ 2 ║ ║ 123 ║ ║ Separate ║ cols ║ with a tab or 4 spaces ║ ║ ║ ║ ║ This is a row ║ with only one cell ║ ║ ║ ║ ║ ╚════════════════╩════════════════════╩════════════════════════╩═══╩══╩═════╝

1 个答案:

答案 0 :(得分:0)

尝试使用pd.read_fwf

例如:

public static void main(String args[]){
    //Convert below lower and upper arrays into one new upperlower map Map<Character, Integer>
    //char[] lower = new char[] {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'};
    //char[] upper = new char[] {'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'};
    Map<Character, Integer> countMap = new TreeMap<>();
    Integer count;
    String inputStr = "Hello WORLD";
    char[] arr = inputStr.toCharArray();

    for(Character c : arr){
        count = 0;
        if(c == ' ')
            continue;
        if(countMap.containsKey(c))
            count = countMap.get(c);
        countMap.put(c, count+1);
    }

    Iterator<Entry<Character, Integer>> it = countMap.entrySet().iterator();
    while (it.hasNext()) {
        Map.Entry<Character, Integer> pair = (Map.Entry<Character, Integer>)it.next();
        System.out.println(pair.getKey() + " = " + pair.getValue());
        it.remove(); // avoids a ConcurrentModificationException

        /*Check pair.getKey() contains on upperlower map if contains assign the new 
          value into map */

    }

    //Finally loop the upperlower Map to display character occurence
}