Question

我有以下格式的100,000多个csv文件：

1,1,5,1,1,1,0,0,6,6,1,1,1,0,1,0,13,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,1,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,2,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,3,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,4,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,5,6,4,1,0,1,0,1,0,4,8,18,20,,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,6,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,7,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,8,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,2,0,0,12,12,1,2,4,1,1,0,13,4,7,8,18,20,21,25,27,29,31,32,,,,,,,,,,,,,,,,

我需要的是字段10和字段17向前，字段10是计数器表示多少整数存储从字段17开始，即我需要的是：

6,13,4,7,8,18,20
5,4,7,8,18,20
5,4,7,8,18,20
5,13,4,7,8,20
5,13,4,7,8,20
4,4,8,18,20
5,4,7,8,18,20
5,13,4,7,8,20
5,13,4,7,8,20
12,13,4,7,8,18,20,21,25,27,29,31,32

需要读取的最大整数数是28.我可以通过C ++中的Getline轻松实现这一点，但是，根据我之前的经验，因为我需要处理超过100,000个这样的文件，每个文件可能有300,000~400,000个这样的行。因此，使用Getline读入数据并构建向量＆gt;可能有严重的性能问题为了我。我尝试使用fscanf来实现这个目标：

while (!feof(stream)){
 fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%d",&MyCounter);
 fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d"); // skip to column 17
 for (int i=0;i<MyCounter;i++){
  fscanf(fstream,"%d",&MyIntArr[i]);
 }
 fscanf(fstream,"%*s"); // to finish the line
}

但是，这会多次调用fscanf，也可能会产生性能问题。有没有办法用fscanf读取1次调用的可变整数？或者我需要读入一个字符串，然后strsep / stoi它？与fscanf相比，哪个从绩效的角度来看会更好吗？

Answer 1

因此，每行最多有43个数字。即使在64位，每个数字也限制为21位，因此一行最多946个字节就足够了1024个字节（只要没有空格）。

@Entity
public class DateActiveScheduleItem {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;

@Basic
@Temporal(TemporalType.DATE)
private java.util.Date date;



@JsonIgnore
@ManyToOne
@JoinColumn(name = "schedule_id")
private Schedule schedule;
//getters etters}

帮助函数跳转到所需的列。

char line[1024];

while (fgets(line, sizeof(line), stdin) != NULL) {
    //...
}

因此，在你的循环中，跳到第10列找到第一个感兴趣的数字，然后跳到第17列开始读取其余的数字。完成的循环如下所示：

const char *find_nth_comma(const char *s, int n) {
    const char *p = s;
    if (p && n) while (*p) {
        if (*p == ',') {
            if (--n == 0) break;
        }
        ++p;
    }
    return p;
}

此方法也适用于while (fgets(line, sizeof(line), stdin) != NULL) { const char *p = find_nth_comma(line, 9); char *end; assert(p && *p); MyCounter = strtol(p+1, &end, 10); assert(*end == ','); p = find_nth_comma(end+1, 6); assert(p && *p); for (int i = 0; i < MyCounter; ++i, p = end) { MyIntArray[i] = strtol(p+1, &end, 10); assert((*end == ',') || (i == MyCounter-1) && (*end == '\0' || isspace(*end & 0xFF))); } }解决方案。 mmap将替换为指向文件中要处理的下一行的函数。 fgets需要修改以检测行尾/文件结尾，而不是依赖于NUL终止的字符串。使用自定义函数更改find_nth_comma，该函数再次检测行尾或文件结尾。（此类更改的目的是删除任何需要复制数据的代码，这将是strtol方法的动机。）

通过并行处理，可以同时解析文件的多个部分。但是，让不同的线程处理不同的文件，然后在处理完所有文件后整理结果就足够了。

Answer 2

最终我使用内存映射文件来解决我的问题（这个解决方案是一个我以前的问题的副产品，阅读大CSV文件时的性能问题） read in large CSV file performance issue in C++

由于我在MS Windows上工作，所以我使用Stephan Brumme的“Portable Memory Mapping C ++ Class” http://create.stephan-brumme.com/portable-memory-mapping/ 因为我不需要处理文件＆gt; 2 GB，我的实现更简单。对于超过2GB的文件，请访问网站以了解如何处理。

请在下面找到我的代码：

// may tried RandomAccess/SequentialScan
MemoryMapped MemFile(FilterBase.BaseFileName, MemoryMapped::WholeFile, MemoryMapped::RandomAccess);

// point to start of memory file
char* start = (char*)MemFile.getData();
// dummy in my case
char* tmpBuffer = start;

// looping counter
uint64_t i = 0;

// pre-allocate result vector
MyVector.resize(300000);

// Line counter
int LnCnt = 0;

//no. of field
int NumOfField=43;
//delimiter count, num of field + 1 since the leading and trailing delimiter are virtual
int DelimCnt=NoOfField+1;
//Delimiter position. May use new to allocate at run time
// or even use vector of integer
// This is to store the delimiter position in each line
// since the position is relative to start of file. if file is extremely
// large, may need to change from int to unsigner, long or even unsigned long long
static  int DelimPos[DelimCnt];

// Max number of field need to read usually equal to NumOfField, can be smaller, eg in my case, I only need 4 fields
// from first 15 field, in this case, can assign 15 to MaxFieldNeed
int MaxFieldNeed=NumOfField;
// keep track how many comma read each line
int DelimCounter=0;
// define field and line seperator
char FieldDelim=',';
char LineSep='\n';

// 1st field, "virtual Delimiter" position
DelimPos[CommaCounter]=-1
DelimCounter++;

// loop through the whole memory field, 1 and only once
for (i = 0; i < MemFile.size();i++)
{
  // grab all position of delimiter in each line
  if ((MemFile[i] == FieldDelim) && (DelimCounter<=MaxFieldNeed)){
    DelimPos[DelimCounter] = i;
    DelimCounter++;
  };

  // grab all values when end of line hit
  if (MemFile[i] == LineSep) {
    // no need to use if (DelimCounter==NumOfField) just assign anyway, waste a little bit
    // memory in integer array but gain performance 
    DelimPos[DelimCounter] = i;
    // I know exactly what the format is and what field(s) I want
    // a more general approach (as a CSV reader) may put all fields
    // into vector of vector of string
    // With *EFFORT* one may modify this piece of code so that it can parse
    // different format at run time eg similar to:
    // fscanf(fstream,"%d,%f....
    // also, this piece of code cannot handle complex CSV e.g.
    // Peter,28,157CM
    // John,26,167CM
    // "Mary,Brown",25,150CM
    MyVector.StrField = string(strat+DelimPos[0] + 1, strat+DelimPos[1] - 1);
    MyVector.IntField = strtol(strat+DelimPos[3] + 1,&tmpBuffer,10);
    MyVector.IntField2 = strtol(strat+DelimPos[8] + 1,&tmpBuffer,10);
    MyVector.FloatField = strtof(start + DelimPos[14] + 1,&tmpBuffer);
    // reset Delim counter each line
    DelimCounter=0
    // previous line seperator treat as first delimiter of next line
    DelimPos[DelimCounter] = i;
    DelimCounter++
    LnCnt++;
  }
}
MyVector.resize(LnCnt);
MyVector.shrink_to_fit();
MemFile.close();
};

我可以编写我想要的任何内容：

  if (MemFile[i] == LineSep) {
  }

例如处理空字段，执行计算等。有了这段代码，我在57秒内处理2100个文件（6.3 GB）！（我在其中编码CSV格式，在我之前的情况下只获取4个值）。稍后将更改此代码以处理此问题。谢谢所有在这个问题上帮助我的人。

Answer 3

为了最大限度地提高性能，您应该使用NSForegroundColorAttributeName或等效内容映射内存中的文件，并使用即席代码解析文件，通常使用指针一次扫描一个字符，检查{{1} }和/或mmap用于记录结束并将数字转换为存储到阵列。棘手的部分是：

如何分配或处理目标数组。
这些字段都是数字？积分？
是换行符终止的最后一条记录？ '\n'来电后，您可以轻松查看此情况。优点是，只有在遇到换行符时才需要检查文件末尾。

Answer 4

读取运行时确定的整数数量的最简单方法可能是指向较长格式字符串的右侧部分。换句话说，我们可以有一个格式字符串，其中包含28个%d,说明符，但在字符串结尾之前指向 n ，并将该指针作为{{1的格式字符串传递}}

举一个简单的例子，考虑从最多6个接受3个整数：

scanf()

箭头显示用作模式参数的字符串指针。

这是一个完整的例子;当使用"%d,%d,%d,%d,%d,%d," ^构建时，其运行时间约为8秒，进行100万次迭代（1000万行）。更新输入字符串指针的机制略显复杂，这在从文件流中读取时显然不是必需的。我已经跳过gcc -O3的检查，但很容易添加。

nfields <= 28

使用fscanf读取整数的可变数

4 个答案: