我有一个很大的.txt文件,接近2GB。首先我试过这个
pd.read_csv("large_text_file.txt", header=0, delim_whitespace=True)
不断抛出Error tokenizing data. C error: Expected 32 fields in line 3, saw 36
错误
然后我尝试了这个:pd.read_csv("wspace.csv", header=0, sep=r"\s+")
但是因为它在技术上不是csv
文件,而某些内容如“名称”中有空格,输出结果非常糟糕。使用pandas error_bad_lines=False
并不理想,因为它只是跳过这个特定表的所有行。有没有办法将大型txt文件格式化为csv?我知道你可以将扩展名从.txt
更改为.csv
但是如果没有逗号,那么他的输出结果很差。示例数据来说明我的痛苦。
而不是显示如下内容:
╔══════════════════════════════════╦═════════╦════════════════════════╗
║ Col1 ║ Col2 ║ Col3 ║
╠══════════════════════════════════╬═════════╬════════════════════════╣
║ Value 1 ║ Value 2 ║ 123 ║
║ Separate ║ cols ║ with a tab or 4 spaces ║
║ This is a row with only one cell ║ ║ ║
╚══════════════════════════════════╩═════════╩════════════════════════╝
它显示了这一点,导致“令牌错误”
╔════════════════╦════════════════════╦════════════════════════╦═══╦══╦═════╗
║ Col1 Col2 Col3 ║ ║ ║ ║ ║ ║
╠════════════════╬════════════════════╬════════════════════════╬═══╬══╬═════╣
║ Value ║ 1 ║ Value ║ 2 ║ ║ 123 ║
║ Separate ║ cols ║ with a tab or 4 spaces ║ ║ ║ ║
║ This is a row ║ with only one cell ║ ║ ║ ║ ║
╚════════════════╩════════════════════╩════════════════════════╩═══╩══╩═════╝
答案 0 :(得分:0)
尝试使用pd.read_fwf。
例如:
public static void main(String args[]){
//Convert below lower and upper arrays into one new upperlower map Map<Character, Integer>
//char[] lower = new char[] {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'};
//char[] upper = new char[] {'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'};
Map<Character, Integer> countMap = new TreeMap<>();
Integer count;
String inputStr = "Hello WORLD";
char[] arr = inputStr.toCharArray();
for(Character c : arr){
count = 0;
if(c == ' ')
continue;
if(countMap.containsKey(c))
count = countMap.get(c);
countMap.put(c, count+1);
}
Iterator<Entry<Character, Integer>> it = countMap.entrySet().iterator();
while (it.hasNext()) {
Map.Entry<Character, Integer> pair = (Map.Entry<Character, Integer>)it.next();
System.out.println(pair.getKey() + " = " + pair.getValue());
it.remove(); // avoids a ConcurrentModificationException
/*Check pair.getKey() contains on upperlower map if contains assign the new
value into map */
}
//Finally loop the upperlower Map to display character occurence
}