Question

我有一个很大的.txt文件，接近2GB。首先我试过这个 pd.read_csv("large_text_file.txt", header=0, delim_whitespace=True) 不断抛出Error tokenizing data. C error: Expected 32 fields in line 3, saw 36错误

然后我尝试了这个：pd.read_csv("wspace.csv", header=0, sep=r"\s+")但是因为它在技术上不是csv文件，而某些内容如“名称”中有空格，输出结果非常糟糕。使用pandas error_bad_lines=False并不理想，因为它只是跳过这个特定表的所有行。有没有办法将大型txt文件格式化为csv？我知道你可以将扩展名从.txt更改为.csv但是如果没有逗号，那么他的输出结果很差。示例数据来说明我的痛苦。

而不是显示如下内容： ╔══════════════════════════════════╦═════════╦════════════════════════╗ ║ Col1 ║ Col2 ║ Col3 ║ ╠══════════════════════════════════╬═════════╬════════════════════════╣ ║ Value 1 ║ Value 2 ║ 123 ║ ║ Separate ║ cols ║ with a tab or 4 spaces ║ ║ This is a row with only one cell ║ ║ ║ ╚══════════════════════════════════╩═════════╩════════════════════════╝ 它显示了这一点，导致“令牌错误” ╔════════════════╦════════════════════╦════════════════════════╦═══╦══╦═════╗ ║ Col1 Col2 Col3 ║ ║ ║ ║ ║ ║ ╠════════════════╬════════════════════╬════════════════════════╬═══╬══╬═════╣ ║ Value ║ 1 ║ Value ║ 2 ║ ║ 123 ║ ║ Separate ║ cols ║ with a tab or 4 spaces ║ ║ ║ ║ ║ This is a row ║ with only one cell ║ ║ ║ ║ ║ ╚════════════════╩════════════════════╩════════════════════════╩═══╩══╩═════╝

Answer 1

尝试使用pd.read_fwf。

例如：

public static void main(String args[]){
    //Convert below lower and upper arrays into one new upperlower map Map<Character, Integer>
    //char[] lower = new char[] {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'};
    //char[] upper = new char[] {'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'};
    Map<Character, Integer> countMap = new TreeMap<>();
    Integer count;
    String inputStr = "Hello WORLD";
    char[] arr = inputStr.toCharArray();

    for(Character c : arr){
        count = 0;
        if(c == ' ')
            continue;
        if(countMap.containsKey(c))
            count = countMap.get(c);
        countMap.put(c, count+1);
    }

    Iterator<Entry<Character, Integer>> it = countMap.entrySet().iterator();
    while (it.hasNext()) {
        Map.Entry<Character, Integer> pair = (Map.Entry<Character, Integer>)it.next();
        System.out.println(pair.getKey() + " = " + pair.getValue());
        it.remove(); // avoids a ConcurrentModificationException

        /*Check pair.getKey() contains on upperlower map if contains assign the new 
          value into map */

    }

    //Finally loop the upperlower Map to display character occurence
}

使用Pandas解析大型txt文件

1 个答案: