使用C ++解析一个巨大的复杂CSV文件

时间:2013-08-16 13:32:54

标签: c++ parsing csv

我有一个大的CSV文件,如下所示:

23456,The End is Near,愚蠢的描述毫无意义,http://www.example.com,45332,1998年7月5日星期日,45.332

这只是CSV文件的一行。这些约有500k。

我想用C ++解析这个文件。我开始使用的代码是:

#include <iostream>
#include <fstream>
#include <string>
#include <sstream>

using namespace std;

int main()
{
    // open the input csv file containing training data
    ifstream inputFile("my.csv");

    string line;

    while (getline(inputFile, line, ','))
    {
        istringstream ss(line);

        // declaring appropriate variables present in csv file
        long unsigned id;
        string url, title, description, datetaken;
        float val1, val2;

        ss >> id >> url >> title >> datetaken >> description >> val1 >> val2;

        cout << url << endl;
    }
    inputFile.close();
}

问题是它没有打印出正确的值。

我怀疑它无法处理场内的空白区域。所以你建议我应该做什么?

由于

5 个答案:

答案 0 :(得分:4)

在这个例子中,我们必须使用两个getline来解析字符串。第一个使用默认换行符分隔符获取一行cvs文本getline(cin, line)。第二个getline(ss, line, ',')使用逗号分隔字符串。

#include <iostream>
#include <sstream>
#include <string>
#include <vector>

float get_float(const std::string& s) { 
    std::stringstream ss(s);
    float ret;
    ss >> ret;
    return ret;
}


int get_int(const std::string& s) { 
    std::stringstream ss(s);
    int ret;
    ss >> ret;
    return ret;
}

int main() {
    std::string line;
    while (getline(cin, line)) {
        std::stringstream ss(line);
        std::vector<std::string> v;
        std::string field;
        while(getline(ss, field, ',')) {
            std::cout << " " << field;
            v.push_back(field);
        }
        int id = get_int(v[0]);
        float f = get_float(v[6]);
        std::cout << v[3] << std::endl;
    }
}

答案 1 :(得分:1)

使用std::istream使用重载的插入运算符来阅读std::strings效果不佳。整行是一个字符串,因此默认情况下不会发现字段有变化。快速解决方法是在逗号上拆分line并将值分配给相应的字段(而不是使用std::istringstream)。

注意:这是jrok关于std::getline

的观点的补充

答案 2 :(得分:1)

在规定的限制范围内,我想我会做这样的事情:

#include <locale>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>

// A ctype that classifies only comma and new-line as "white space":
struct field_reader : std::ctype<char> {

    field_reader() : std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table() {
        static std::vector<std::ctype_base::mask>
            rc(table_size, std::ctype_base::mask());

        rc[','] = std::ctype_base::space;
        rc['\n'] = std::ctype_base::space;
        return &rc[0];
    }
};

// A struct to hold one record from the file:
struct record {
    std::string key, name, desc, url, zip, date, number;

    friend std::istream &operator>>(std::istream &is, record &r) {
        return is >> r.key >> r.name >> r.desc >> r.url >> r.zip >> r.date >> r.number;
    }

    friend std::ostream &operator<<(std::ostream &os, record const &r) {
        return os << "key: " << r.key
            << "\nname: " << r.name
            << "\ndesc: " << r.desc
            << "\nurl: " << r.url
            << "\nzip: " << r.zip
            << "\ndate: " << r.date
            << "\nnumber: " << r.number;
    }
};

int main() {
    std::stringstream input("23456, The End is Near, A silly description that makes no sense, http://www.example.com, 45332, 5th July 1998 Sunday, 45.332");

    // use our ctype facet with the stream:
    input.imbue(std::locale(std::locale(), new field_reader()));

    // read in all our records:
    std::istream_iterator<record> in(input), end;
    std::vector<record> records{ in, end };

    // show what we read:
    std::copy(records.begin(), records.end(),
              std::ostream_iterator<record>(std::cout, "\n"));

}

毫无疑问,这比其他大多数人都长 - 但它们都被分解成小的,大部分可重复使用的部分。一旦你有其他部分,读取数据的代码是微不足道的:

    std::vector<record> records{ in, end };

另一点我觉得引人注目:第一次编译代码时,它也正确运行(我发现这种编程方式非常常规)。

答案 3 :(得分:0)

我刚刚为自己解决了这个问题并愿意分享!这可能有点矫枉过正,但它展示了Boost Tokenizer&amp; amp;向量处理一个大问题。

/*
 * ALfred Haines Copyleft 2013
 * convert csv to sql file
 * csv2sql requires that each line is a unique record
 *
 * This example of file read and the Boost tokenizer
 *
 * In the spirit of COBOL I do not output until the end
 * when all the print lines are ouput at once
 * Special thanks to SBHacker for the code to handle linefeeds
*/
#include <sstream>
#include <boost/tokenizer.hpp>
#include <boost/iostreams/device/file.hpp>
#include <boost/iostreams/stream.hpp>
#include <boost/algorithm/string/replace.hpp>
#include <vector>

namespace io = boost::iostreams;
using boost::tokenizer;
using boost::escaped_list_separator;
typedef tokenizer<escaped_list_separator<char> > so_tokenizer;

using namespace std;
using namespace boost;

vector<string> parser( string );


int main()
{
vector<string> stuff ; // this is the data in a vector
string filename; // this is the input file
string c = ""; // this holds the print line
string sr ;

cout << "Enter filename: " ;
cin >> filename;
//filename = "drwho.csv";
int lastindex = filename.find_last_of("."); // find where the extension begins
string rawname = filename.substr(0, lastindex); // extract the raw name

stuff = parser( filename ); // this gets the data from the file

/** I ask if the user wants a new_index to be created */
cout << "\n\nMySql requires a unique ID field as a Primary Key \n" ;
cout << "If the first field is not unique (no dupicate entries) \nthan you should create a " ;
cout << "New index field for this data.\n" ;
cout << "Not Sure! try no first to maintain data integrity.\n" ;
string ni ;bool invalid_data = true;bool new_index = false ;
    do {
        cout<<"Should I create a New Index now? (y/n)"<<endl;
        cin>>ni;
    if ( ni  == "y" || ni  == "n" ) { invalid_data =false ;  }
        } while (invalid_data);
    cout << "\n" ;
if (ni  == "y" )
{
  new_index = true ;
  sr = rawname.c_str() ; sr.append("_id" ); // new_index field
}

// now make the sql code from the vector stuff
// Create table section
c.append("DROP TABLE IF EXISTS `");
c.append(rawname.c_str() );
c.append("`;");
c.append("\nCREATE TABLE IF NOT EXISTS `");
c.append(rawname.c_str() );
c.append( "` (");
c.append("\n");
if (new_index)
{
c.append( "`");
c.append(sr );
c.append( "`  int(10) unsigned NOT NULL,");
c.append("\n");
}

string s = stuff[0];// it is assumed that line zero has fieldnames

int x =0 ; // used to determine if new index is printed

// boost tokenizer code from the Boost website -- tok holds the token
so_tokenizer tok(s, escaped_list_separator<char>('\\', ',', '\"'));
for(so_tokenizer::iterator beg=tok.begin(); beg!=tok.end(); ++beg)
  {
    x++; // keeps number of fields for later use to eliminate the comma on the last entry
    if (x == 1 && new_index == false ) sr = static_cast<string> (*beg) ;
    c.append( "`" );
    c.append(*beg);
    if (x == 1 && new_index == false )
    {
      c.append( "`  int(10) unsigned NOT NULL,");
    }
    else
    {
    c.append("`  text ,");
    }
    c.append("\n");
    }
c.append("PRIMARY KEY (`");
c.append(sr );
c.append("`)" );
c.append("\n");
c.append( ") ENGINE=InnoDB DEFAULT CHARSET=latin1;");
c.append("\n");
c.append("\n");
// The Create table section is done

// Now make the Insert lines one per line is safer in case you need to split the sql file
for (int w=1; w < stuff.size(); ++w)
  {
    c.append("INSERT INTO `");
    c.append(rawname.c_str() );
    c.append("` VALUES (  ");
if (new_index)
{
    string String = static_cast<ostringstream*>( &(ostringstream() << w) )->str();
    c.append(String);
    c.append(" , ");
}
    int p = 1 ; // used to eliminate the comma on the last entry
    // tokenizer code needs unique name -- stok holds this token
    so_tokenizer stok(stuff[w], escaped_list_separator<char>('\\', ',', '\"'));
    for(so_tokenizer::iterator beg=stok.begin(); beg!=stok.end(); ++beg)
    {
      c.append(" '");
      string str = static_cast<string> (*beg) ;
      boost::replace_all(str, "'", "\\'");
//      boost::replace_all(str, "\n", " -- ");
      c.append( str);
      c.append("' ");
      if ( p < x ) c.append(",")  ;// we dont want a comma on the last entry
      p++ ;
    }
    c.append( ");\n");
  }

// now print the whole thing to an output file
string out_file = rawname.c_str() ;
out_file.append(".sql");
io::stream_buffer<io::file_sink> buf(out_file);
std::ostream out(&buf);
out << c ;

// let the user know that they are done
cout<< "Well if you got here then the data should be in the file " << out_file << "\n" ;

return 0;}

vector<string> parser( string filename )
{
    typedef tokenizer< escaped_list_separator<char> > Tokenizer;
    escaped_list_separator<char> sep('\\', ',', '\"');
    vector<string> stuff ;
    string data(filename);
    ifstream in(filename.c_str());
    string li;
    string buffer;
    bool inside_quotes(false);
    size_t last_quote(0);
    while (getline(in,buffer))
    {
        // --- deal with line breaks in quoted strings
        last_quote = buffer.find_first_of('"');
        while (last_quote != string::npos)
        {
            inside_quotes = !inside_quotes;
            last_quote = buffer.find_first_of('"',last_quote+1);
        }
        li.append(buffer);
        if (inside_quotes)
        {
            li.append("\n");
            continue;
        }
        // ---
        stuff.push_back(li);
        li.clear(); // clear here, next check could fail
    }
    in.close();
    //cout << stuff.size() << endl ;
    return stuff ;

}

答案 4 :(得分:0)

您可能会怀疑您的代码行为不正常,因为字段值中的空格。

如果你确实有&#34;简单&#34; CSV中没有字段可能在字段值中包含逗号,那么我将从流操作符和C ++中一起离开。问题中的示例程序仅重新排序字段。没有必要将值实际解释或转换为适当的类型(除非验证也是目标)。单独重新排序是使用awk轻松完成的 super 。例如,以下命令将反转在简单CSV文件中找到的3个字段。

cat infile | awk -F, '{ print $3","$2","$1 }' > outfile

如果目标确实是使用此代码段作为更大更好的创意的启动板......那么我会通过搜索逗号来标记该行。 std :: string类有一个内置的方法来查找偏移特定字符。您可以将此方法视为您想要的优雅或不优雅。最优雅的方法最终看起来像增强标记化代码。

快速而肮脏的方法是只知道你的程序有N个字段并查找相应N-1个逗号的位置。一旦你有了这些位置,调用std :: string :: substr来提取感兴趣的字段非常简单。