我有一个java服务器应用程序,可下载CSV文件并解析它。解析可能需要5到45分钟,并且每小时发生一次。这种方法是应用程序的瓶颈,因此它不是过早的优化。到目前为止的代码:
client.executeMethod(method);
InputStream in = method.getResponseBodyAsStream(); // this is http stream
String line;
String[] record;
reader = new BufferedReader(new InputStreamReader(in), 65536);
try {
// read the header line
line = reader.readLine();
// some code
while ((line = reader.readLine()) != null) {
// more code
line = line.replaceAll("\"\"", "\"NULL\"");
// Now remove all of the quotes
line = line.replaceAll("\"", "");
if (!line.startsWith("ERROR"){
//bla bla
continue;
}
record = line.split(",");
//more error handling
// build the object and put it in HashMap
}
//exceptions handling, closing connection and reader
是否有任何现有的图书馆可以帮助我加快速度?我可以改进现有代码吗?
答案 0 :(得分:18)
你见过Apache Commons CSV吗?
split
请记住,split
仅返回数据视图,这意味着原始line
对象不符合垃圾回收条件,同时引用其任何视图。制作防御性副本可能会有所帮助吗? (Java bug report)
对包含逗号
的转义CSV列进行分组也不可靠答案 1 :(得分:13)
答案 2 :(得分:5)
除了上面提出的建议之外,我认为你可以尝试通过使用一些线程和并发来改进你的代码。
以下是简要分析和建议的解决方案
答案 3 :(得分:5)
你的代码的问题是它使用replaceAll和split是非常昂贵的操作。你绝对应该考虑使用一个可以进行一次解析的csv解析器/阅读器。
github有一个基准
不幸的是,它在java 6下运行。在java 7和8下,数字略有不同。我正在尝试获取更多不同文件大小的详细数据,但它正在进行中答案 4 :(得分:2)
你应该看看OpenCSV。我希望他们有性能优化。
答案 5 :(得分:2)
新来的孩子在街上。它使用Java注释,并基于apache-csv构建,后者是用于csv解析的更快的库之一。
如果您想并且可以重复使用CSVProcessor,那么该库也是线程安全的。
示例:
Pojo
@CSVReadComponent(type = CSVType.NAMED)
@CSVWriteComponent(type = CSVType.ORDER)
public class Pojo {
@CSVWriteBinding(order = 0)
private String name;
@CSVWriteBinding(order = 1)
@CSVReadBinding(header = "age")
private Integer age;
@CSVWriteBinding(order = 2)
@CSVReadBinding(header = "money")
private Double money;
@CSVReadBinding(header = "name")
public void setA(String name) {
this.name = name;
}
@Override
public String toString() {
return "Name: " + name + System.lineSeparator() + "\tAge: " + age + System.lineSeparator() + "\tMoney: "
+ money;
}}
主要
import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.*;
public class SimpleMain {
public static void main(String[] args) {
String csv = "name,age,money" + System.lineSeparator() + "Michael Williams,34,39332.15";
CSVProcessor processor = new CSVProcessor(Pojo.class);
List<Pojo> list = new ArrayList<>();
try {
list.addAll(processor.parse(new StringReader(csv)));
list.forEach(System.out::println);
System.out.println();
StringWriter sw = new StringWriter();
processor.write(list, sw);
System.out.println(sw.toString());
} catch (IOException e) {
}
}}
由于它是基于apache-csv构建的,因此您可以使用功能强大的工具CSVFormat。可以说,csv的分隔符是管道(|)而不是逗号(,),例如:
CSVFormat csvFormat = CSVFormat.DEFAULT.withDelimiter('|');
List<Pojo> list = processor.parse(new StringReader(csv), csvFormat);
还考虑了inheritance的另一个好处。
有关处理读写non-primitive data的其他示例
答案 6 :(得分:1)
到这里有点晚了,现在有一些CSV解析器的基准测试项目。您的选择将取决于确切的用例(即原始数据与数据绑定等)。
答案 7 :(得分:0)
为了速度,您不想使用replaceAll,也不想使用正则表达式。在这样的紧急情况下,您基本上总是想做的事情就是通过字符解析器制作一个状态机字符。我已经完成了将整个过程变成一个Iterable函数的工作。它还可以接收流并对其进行解析,而无需将其保存或缓存。因此,如果您可以提早中止手术,那也可能会很好。它也应该足够短并且编码得足够好,以使其显而易见。
public static Iterable<String[]> parseCSV(final InputStream stream) throws IOException {
return new Iterable<String[]>() {
@Override
public Iterator<String[]> iterator() {
return new Iterator<String[]>() {
static final int UNCALCULATED = 0;
static final int READY = 1;
static final int FINISHED = 2;
int state = UNCALCULATED;
ArrayList<String> value_list = new ArrayList<>();
StringBuilder sb = new StringBuilder();
String[] return_value;
public void end() {
end_part();
return_value = new String[value_list.size()];
value_list.toArray(return_value);
value_list.clear();
}
public void end_part() {
value_list.add(sb.toString());
sb.setLength(0);
}
public void append(int ch) {
sb.append((char) ch);
}
public void calculate() throws IOException {
boolean inquote = false;
while (true) {
int ch = stream.read();
switch (ch) {
default: //regular character.
append(ch);
break;
case -1: //read has reached the end.
if ((sb.length() == 0) && (value_list.isEmpty())) {
state = FINISHED;
} else {
end();
state = READY;
}
return;
case '\r':
case '\n': //end of line.
if (inquote) {
append(ch);
} else {
end();
state = READY;
return;
}
break;
case ',': //comma
if (inquote) {
append(ch);
} else {
end_part();
break;
}
break;
case '"': //quote.
inquote = !inquote;
break;
}
}
}
@Override
public boolean hasNext() {
if (state == UNCALCULATED) {
try {
calculate();
} catch (IOException ex) {
}
}
return state == READY;
}
@Override
public String[] next() {
if (state == UNCALCULATED) {
try {
calculate();
} catch (IOException ex) {
}
}
state = UNCALCULATED;
return return_value;
}
};
}
};
}
您通常会非常有帮助地进行处理,例如:
for (String[] csv : parseCSV(stream)) {
//<deal with parsed csv data>
}
那里的API之所以值得,是因为它看起来很神秘。
答案 8 :(得分:0)
是否有任何现有的库可以帮助我加快速度?
是的,根据我的经验,Apache Commons CSV项目效果很好。
这是一个示例应用程序,该应用程序使用Apache Commons CSV库写入和读取24列的行:整数序号,Instant
,其余均为随机的UUID
对象。
对于10,000行,写入和读取各花费大约半秒钟。阅读内容包括重构Integer
,Instant
和UUID
对象。
我的示例代码使您可以打开或关闭重构对象。我都跑了一百万行。这将创建一个850兆的文件。我在MacBook Pro(2013年末推出的15英寸视网膜),2.3 GHz Intel Core i7、16 GB 1600 MHz DDR3和Apple内置SSD上使用Java 12。
对于一百万行,十秒钟用于读取,两秒钟用于解析:
源代码是单个.java
文件。有一个write方法和一个read
方法。两种方法都从main
方法中调用。
我通过致电BufferedReader
打开了Files.newBufferedReader
。
package work.basil.example;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVPrinter;
import org.apache.commons.csv.CSVRecord;
import java.io.*;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.Duration;
import java.time.Instant;
import java.util.UUID;
public class CsvReadingWritingDemo
{
public static void main ( String[] args )
{
CsvReadingWritingDemo app = new CsvReadingWritingDemo();
app.write();
app.read();
}
private void write ()
{
Instant start = Instant.now();
int limit = 1_000_000; // 10_000 100_000 1_000_000
Path path = Paths.get( "/Users/basilbourque/IdeaProjects/Demo/csv.txt" );
try (
Writer writer = Files.newBufferedWriter( path, StandardCharsets.UTF_8 );
CSVPrinter printer = new CSVPrinter( writer , CSVFormat.RFC4180 );
)
{
printer.printRecord( "id" , "instant" , "uuid_01" , "uuid_02" , "uuid_03" , "uuid_04" , "uuid_05" , "uuid_06" , "uuid_07" , "uuid_08" , "uuid_09" , "uuid_10" , "uuid_11" , "uuid_12" , "uuid_13" , "uuid_14" , "uuid_15" , "uuid_16" , "uuid_17" , "uuid_18" , "uuid_19" , "uuid_20" , "uuid_21" , "uuid_22" );
for ( int i = 1 ; i <= limit ; i++ )
{
printer.printRecord( i , Instant.now() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() );
}
} catch ( IOException ex )
{
ex.printStackTrace();
}
Instant stop = Instant.now();
Duration d = Duration.between( start , stop );
System.out.println( "Wrote CSV for limit: " + limit );
System.out.println( "Elapsed: " + d );
}
private void read ()
{
Instant start = Instant.now();
int count = 0;
Path path = Paths.get( "/Users/basilbourque/IdeaProjects/Demo/csv.txt" );
try (
Reader reader = Files.newBufferedReader( path , StandardCharsets.UTF_8) ;
)
{
CSVFormat format = CSVFormat.RFC4180.withFirstRecordAsHeader();
CSVParser parser = CSVParser.parse( reader , format );
for ( CSVRecord csvRecord : parser )
{
if ( true ) // Toggle parsing of the string data into objects. Turn off (`false`) to see strictly the time taken by Apache Commons CSV to read & parse the lines. Turn on (`true`) to get a feel for real-world load.
{
Integer id = Integer.valueOf( csvRecord.get( 0 ) ); // Annoying zero-based index counting.
Instant instant = Instant.parse( csvRecord.get( 1 ) );
for ( int i = 3 - 1 ; i <= 22 - 1 ; i++ ) // Subtract one for annoying zero-based index counting.
{
UUID uuid = UUID.fromString( csvRecord.get( i ) );
}
}
count++;
if ( count % 1_000 == 0 ) // Every so often, report progress.
{
//System.out.println( "# " + count );
}
}
} catch ( IOException e )
{
e.printStackTrace();
}
Instant stop = Instant.now();
Duration d = Duration.between( start , stop );
System.out.println( "Read CSV for count: " + count );
System.out.println( "Elapsed: " + d );
}
}