我有两个csv文件。一个主 CSV文件约500000条记录。另一个 DailyCSV文件有50000条记录。
DailyCSV文件错过了几个必须从主CSV文件中提取的列。
例如
DailyCSV文件
id,name,city,zip,occupation
1,Jhon,Florida,50069,Accountant
MasterCSV文件
id,name,city,zip,occupation,company,exp,salary
1, Jhon, Florida, 50069, Accountant, AuditFirm, 3, $5000
我要做的是,读取两个文件,将记录与ID
匹配,如果主文件中存在ID
,那么我必须获取company, exp, salary
并写入到一个新的csv文件。
如何实现这一点。??
我目前做了什么
while (true) {
line = bstream.readLine();
lineMaster = bstreamMaster.readLine();
if (line == null || lineMaster == null)
{
break;
}
else
{
while(lineMaster != null)
readlineSplit = line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1);
String splitId = readlineSplit[4];
String[] readLineSplitMaster =lineMaster.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1);
String SplitIDMaster = readLineSplitMaster[13];
System.out.println(splitId + "|" + SplitIDMaster);
//System.out.println(splitId.equalsIgnoreCase(SplitIDMaster));
if (splitId.equalsIgnoreCase(SplitIDMaster)) {
String writeLine = readlineSplit[0] + "," + readlineSplit[1] + "," + readlineSplit[2] + "," + readlineSplit[3] + "," + readlineSplit[4] + "," + readlineSplit[5] + "," + readLineSplitMaster[15]+ "," + readLineSplitMaster[16] + "," + readLineSplitMaster[17];
System.out.println(writeLine);
pstream.print(writeLine + "\r\n");
}
}
}pstream.close();
fout.flush();
bstream.close();
bstreamMaster.close();
答案 0 :(得分:1)
首先,您当前的解析方法将非常缓慢。使用专用的CSV解析库来加快速度。使用uniVocity-parsers,您可以在不到一秒的时间内处理500K记录。这是你如何使用它来解决你的问题:
//opens the file for reading (using UTF-8 encoding)
private static Reader newReader(String pathToFile) {
try {
return new InputStreamReader(new FileInputStream(new File(pathToFile)), "UTF-8");
} catch (Exception e) {
throw new IllegalArgumentException("Unable to open file for reading at " + pathToFile, e);
}
}
//creates a file for writing (using UTF-8 encoding)
private static Writer newWriter(String pathToFile) {
try {
return new OutputStreamWriter(new FileOutputStream(new File(pathToFile)), "UTF-8");
} catch (Exception e) {
throw new IllegalArgumentException("Unable to open file for writing at " + pathToFile, e);
}
}
public static void main(String... args){
//First we parse the daily update file.
CsvParserSettings settings = new CsvParserSettings();
//here we tell the parser to read the CSV headers
settings.setHeaderExtractionEnabled(true);
//and to select ONLY the following columns.
//This ensures rows with a fixed size will be returned in case some records come with less or more columns than anticipated.
settings.selectFields("id", "name", "city", "zip", "occupation");
CsvParser parser = new CsvParser(settings);
//Here we parse all data into a list.
List<String[]> dailyRecords = parser.parseAll(newReader("/path/to/daily.csv"));
//And convert them to a map. ID's are the keys.
Map<String, String[]> mapOfDailyRecords = toMap(dailyRecords);
... //we'll get back here in a second.
/* Converts a list of records to a map. Uses element at index 0 as the key */
private static Map<String, String[]> toMap(List<String[]> records) {
HashMap<String, String[]> map = new HashMap<String, String[]>();
for (String[] row : records) {
//column 0 will always have an ID.
map.put(row[0], row);
}
return map;
}
private static List<Object[]> processMasterFile(final Map<String, String[]> mapOfDailyRecords) {
//we'll put the updated data here
final List<Object[]> output = new ArrayList<Object[]>();
//configures the parser to process only the columns you are interested in.
CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true);
settings.selectFields("id", "company", "exp", "salary");
//All parsed rows will be submitted to the following RowProcessor. This way the bigger Master file won't
//have all its rows stored in memory.
settings.setRowProcessor(new AbstractRowProcessor() {
@Override
public void rowProcessed(String[] row, ParsingContext context) {
// Incoming rows from MASTER will have the ID as index 0.
// If the daily update map contains the ID, we'll get the daily row
String[] dailyData = mapOfDailyRecords.get(row[0]);
if (dailyData != null) {
//We got a match. Let's join the data from the daily row with the master row.
Object[] mergedRow = new Object[8];
for (int i = 0; i < dailyData.length; i++) {
mergedRow[i] = dailyData[i];
}
for (int i = 1; i < row.length; i++) { //starts from 1 to skip the ID at index 0
mergedRow[i + dailyData.length - 1] = row[i];
}
output.add(mergedRow);
}
}
});
CsvParser parser = new CsvParser(settings);
//the parse() method will submit all rows to the RowProcessor defined above.
parser.parse(newReader("/path/to/master.csv"));
return output;
}
... // getting back to the main method here
//Now we process the master data and get a list of updates
List<Object[]> updatedData = processMasterFile(mapOfDailyRecords);
//And write the updated data to another file
CsvWriterSettings writerSettings = new CsvWriterSettings();
writerSettings.setHeaders("id", "name", "city", "zip", "occupation", "company", "exp", "salary");
writerSettings.setHeaderWritingEnabled(true);
CsvWriter writer = new CsvWriter(newWriter("/path/to/updates.csv"), writerSettings);
//Here we write everything, and get the job done.
writer.writeRowsAndClose(updatedData);
}
这应该像魅力一样。希望它有所帮助。
披露:我是这个图书馆的作者。它是开源和免费的(Apache V2.0许可证)。
答案 1 :(得分:0)
我将逐步解决问题。
首先,我将解析/读取主CSV文件,并将其内容保存到散列图中,其中密钥将是每个记录的唯一标识&#39; id&#39;至于值,也许你可以将它们存储在一个哈希中,或者只是创建一个java类来存储信息。
哈希示例:
{
'1' : { 'name': 'Jhon',
'City': 'Florida',
'zip' : 50069,
....
}
}
接下来,阅读比较器csv文件。对于每一行,请阅读&#39; id&#39;并检查您之前创建的hashmap上是否存在该键。
如果存在,则从hashmap访问您需要的信息并写入新的CSV文件。
此外,您可能需要考虑使用第三方CSV解析器来简化此任务。
如果你有maven,也许你可以按照我在网上找到的这个例子。否则你只需google for apache&#csv parser&#39;在互联网上的例子。