我总是在我的java程序中使用univocity解析器来比较csv文件。它工作得很好而且速度更快。
但问题是,这次我试图解析具有复杂值的两个不同的大容量csv文件,并在新的csv文件中打印差异,
查看作者的一个例子,我尝试在将file1读入列表然后转换为map之后使用processFile,但在解析时仍然会出错。
以下是我的示例输入和预期输出文件。
INPUT - file1
"h1","h2","h3","h4","h5"
"00000","US","9503.00.0089","USA","9503.0089"
"","EU","9503.00.7000","EUROPEAN UNION","9503.00.7000"
"#1200","US","5601.22.0010","USA","5601.22.0010"
"0180691","US","9503.00.0073","USA","9503.00.0073"
“DRTY01”,”CA”,”9603.01.0088”,”CAN”,”9603.01.0088”
INPUT - file2
"h1","h2","h3","h6","h7","h8","h9","h10",h11
"018890","US","","2015","101","1","1","All",””
"00000","US","9503.00.0090","1986","101","1","1","All","9503.00.0090"
"0180691","US","9503.00.0073","2019","101","1","1","All","9503.00.0073”
“DRTY01”,”CA”,”9603.01.0087”,”2002”,”102”,”1”,”2”,”CA”, “9603.01.0087”
在file1和file2中选择h1,h2常用值,然后将file1的h3与file2的h3进行比较,如果两个文件h3不相等,则我要打印“h1”,“h4”,“h10”,“h5” ,“h11”,“h6”,“h7”,“h8”,“h9”到file3
OUTPUT - file3
“h1”,”h4”,” h10”,”h5”, ”h11”,”h6”,”h7”,”h8”,”h9”
"00000","USA”,”All”,”9503.00.0089”,”9503.00.0090”, "1986","101","1","1"
"DRTY01”,“CAN”,”CA”,”9603.01.0088”,“9603.01.0087”,”2002”,”102”,”1”,”2”
答案 0 :(得分:2)
我有一个解决方案,但请回归测试。所以我假设的是 h1和h2组合将是一个独特的值 。我创建了一个HashMap,其中包含as作为键的映射,并将csv文件的整行作为值。我们将覆盖创建的类的hashcode和equals方法,如:
等于的逻辑将是 - 如果在map1和map2中h1和h2都相同而h3不同则给我map1和map2中的行。此逻辑在映射中使用了额外的空间,但整体计算逻辑减少为 O(N)。下面的代码将为您提供所需的行。我没有正确执行IO和异常处理,请相应地处理它们。
测试类
public class UnivocityTest
{
public static void main(String[] args) throws FileNotFoundException
{
// Get data from csv file1
List<String[]> f1 = getData("example.csv");
// Get data from csv file2
List<String[]> f2 = getData("example1.csv");
// Convert data to a Map with HeaderList class and entire row.
Map<HeaderList, String[]> map1 = convertAndReturn(f1);
Map<HeaderList, String[]> map2 = convertAndReturn(f2);
//Currently prints the required rows.
compareData(map1, map2);
}
// Convert csv to List<String[]>
private static List<String[]> getData(String file) throws FileNotFoundException
{
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setLineSeparatorDetectionEnabled(true);
RowListProcessor rowProcessor = new RowListProcessor();
parserSettings.setProcessor(rowProcessor);
parserSettings.setHeaderExtractionEnabled(true);
CsvParser parser = new CsvParser(parserSettings);
parser.parse(getReader(file));
// String[] headers = rowProcessor.getHeaders();
List<String[]> rows = rowProcessor.getRows();
return rows;
}
// get reader object
private static Reader getReader(String string) throws FileNotFoundException
{
// TODO Add proper file handling and exception handling
return new FileReader(new File(string));
}
// Return HashMap
private static Map<HeaderList, String[]> convertAndReturn(List<String[]> f1)
{
Map<HeaderList, String[]> map = new java.util.HashMap<>();
for (String[] each : f1)
{
// For each row in csv create a corresponding HeaderList object with h1,h2 and h3 as key
// and row as value.
HeaderList header = new HeaderList(each[0], each[1], each[2]);
map.put(header, each);
}
return map;
}
private static void compareData(Map<HeaderList, String[]> map1, Map<HeaderList, String[]> map2)
{
// Iterates over the map1 keys one by one. For each key we check if there is a matching key
// in map2. The matching condition will be h1 and h2 should be same while h3 should be
// different. Once a key like that is found currently I'm printing both the rows, here you
// can get the rows you want from the map and return them.
for (HeaderList each : map1.keySet())
{
if (map2.containsKey(each))
{
//TODO Assume you want columns h3,h4 from file1 and h6 h7 from file2.
//We know map1 represents file1 with columns h3 and h4 at positions 2 and 3 inside the String[]
//We know map2 represents file1 with columns h6 and h7 at positions 3 and 4 inside the String[]
String h3FromFile1 = map1.get(each)[2];
String h4FromFile1 = map1.get(each)[3];
String h6FromFile2 = map2.get(each)[3];
String h7FromFile2 = map2.get(each)[4];
System.out.println("Required Columns: ");
System.out.println("h3 file1: "+ h3FromFile1);
System.out.println("h4 file1: "+ h4FromFile1);
System.out.println("h6 file2: "+ h6FromFile2);
System.out.println("h7 file2: " + h7FromFile2);
System.out.println(Arrays.toString(map1.get(each)));
System.out.println(Arrays.toString(map2.get(each)));
System.out.println("-------------------------------");
}
}
}
}
将有三列h1,h2,h3的bean类:
class HeaderList
{
private String h1;
private String h2;
private String h3;
public HeaderList(String h1, String h2, String h3)
{
super();
this.h1 = h1;
this.h2 = h2;
this.h3 = h3;
}
/**
* The hash code method which generate same hashkey for h1 and h2.
*
* @inheritDoc
*/
@Override
public int hashCode()
{
final int prime = 31;
int result = 1;
result = prime * result + ((h1 == null) ? 0 : h1.hashCode());
result = prime * result + ((h2 == null) ? 0 : h2.hashCode());
return result;
}
/**
* The equals method assumes each csv file row will be uniquely identified my h1 and h2
* combined. Please see if h1 and h2 cannot be uniquely identified then it may lead to data
* loss. For h3 we return true only for same values.
*
* @inheritDoc
*/
@Override
public boolean equals(Object obj)
{
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
HeaderList other = (HeaderList) obj;
if (h1 == null)
{
if (other.h1 != null)
return false;
}
else if (!h1.equals(other.h1))
return false;
if (h2 == null)
{
if (other.h2 != null)
return false;
}
else if (!h2.equals(other.h2))
return false;
if (h3 == null)
{
if (other.h3 == null)
return false;
}
else if (h3.equals(other.h3))
return false;
return true;
}
/**
* @inheritDoc
*/
@Override
public String toString()
{
return "HeaderList [h1=" + h1 + ", h2=" + h2 + ", h3=" + h3 + "]";
}
}
给定输入csv文件的输出:
Required Columns:
h3 file1: 9603.01.0088
h4 file1: CAN
h6 file2: 2002
h7 file2: 102
[DRTY01, CA, 9603.01.0088, CAN, 9603.01.0088]
[DRTY01, CA, 9603.01.0087, 2002, 102, 1, 2, CA, 9603.01.0087]
-------------------------------
Required Columns:
h3 file1: 9503.00.0089
h4 file1: USA
h6 file2: 1986
h7 file2: 101
[00000, US, 9503.00.0089, USA, 9503.0089]
[00000, US, 9503.00.0090, 1986, 101, 1, 1, All, 9503.00.0090]
-------------------------------