我有一个大的CSV文件,如下所示:
23456,The End is Near,愚蠢的描述毫无意义,http://www.example.com,45332,1998年7月5日星期日,45.332
这只是CSV文件的一行。这些约有500k。
我想用C ++解析这个文件。我开始使用的代码是:
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
using namespace std;
int main()
{
// open the input csv file containing training data
ifstream inputFile("my.csv");
string line;
while (getline(inputFile, line, ','))
{
istringstream ss(line);
// declaring appropriate variables present in csv file
long unsigned id;
string url, title, description, datetaken;
float val1, val2;
ss >> id >> url >> title >> datetaken >> description >> val1 >> val2;
cout << url << endl;
}
inputFile.close();
}
问题是它没有打印出正确的值。
我怀疑它无法处理场内的空白区域。所以你建议我应该做什么?
由于
答案 0 :(得分:4)
在这个例子中,我们必须使用两个getline
来解析字符串。第一个使用默认换行符分隔符获取一行cvs文本getline(cin, line)
。第二个getline(ss, line, ',')
使用逗号分隔字符串。
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
float get_float(const std::string& s) {
std::stringstream ss(s);
float ret;
ss >> ret;
return ret;
}
int get_int(const std::string& s) {
std::stringstream ss(s);
int ret;
ss >> ret;
return ret;
}
int main() {
std::string line;
while (getline(cin, line)) {
std::stringstream ss(line);
std::vector<std::string> v;
std::string field;
while(getline(ss, field, ',')) {
std::cout << " " << field;
v.push_back(field);
}
int id = get_int(v[0]);
float f = get_float(v[6]);
std::cout << v[3] << std::endl;
}
}
答案 1 :(得分:1)
使用std::istream
使用重载的插入运算符来阅读std::strings
效果不佳。整行是一个字符串,因此默认情况下不会发现字段有变化。快速解决方法是在逗号上拆分line
并将值分配给相应的字段(而不是使用std::istringstream
)。
注意:这是jrok关于std::getline
答案 2 :(得分:1)
在规定的限制范围内,我想我会做这样的事情:
#include <locale>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>
// A ctype that classifies only comma and new-line as "white space":
struct field_reader : std::ctype<char> {
field_reader() : std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table() {
static std::vector<std::ctype_base::mask>
rc(table_size, std::ctype_base::mask());
rc[','] = std::ctype_base::space;
rc['\n'] = std::ctype_base::space;
return &rc[0];
}
};
// A struct to hold one record from the file:
struct record {
std::string key, name, desc, url, zip, date, number;
friend std::istream &operator>>(std::istream &is, record &r) {
return is >> r.key >> r.name >> r.desc >> r.url >> r.zip >> r.date >> r.number;
}
friend std::ostream &operator<<(std::ostream &os, record const &r) {
return os << "key: " << r.key
<< "\nname: " << r.name
<< "\ndesc: " << r.desc
<< "\nurl: " << r.url
<< "\nzip: " << r.zip
<< "\ndate: " << r.date
<< "\nnumber: " << r.number;
}
};
int main() {
std::stringstream input("23456, The End is Near, A silly description that makes no sense, http://www.example.com, 45332, 5th July 1998 Sunday, 45.332");
// use our ctype facet with the stream:
input.imbue(std::locale(std::locale(), new field_reader()));
// read in all our records:
std::istream_iterator<record> in(input), end;
std::vector<record> records{ in, end };
// show what we read:
std::copy(records.begin(), records.end(),
std::ostream_iterator<record>(std::cout, "\n"));
}
毫无疑问,这比其他大多数人都长 - 但它们都被分解成小的,大部分可重复使用的部分。一旦你有其他部分,读取数据的代码是微不足道的:
std::vector<record> records{ in, end };
另一点我觉得引人注目:第一次编译代码时,它也正确运行(我发现这种编程方式非常常规)。
答案 3 :(得分:0)
我刚刚为自己解决了这个问题并愿意分享!这可能有点矫枉过正,但它展示了Boost Tokenizer&amp; amp;向量处理一个大问题。
/*
* ALfred Haines Copyleft 2013
* convert csv to sql file
* csv2sql requires that each line is a unique record
*
* This example of file read and the Boost tokenizer
*
* In the spirit of COBOL I do not output until the end
* when all the print lines are ouput at once
* Special thanks to SBHacker for the code to handle linefeeds
*/
#include <sstream>
#include <boost/tokenizer.hpp>
#include <boost/iostreams/device/file.hpp>
#include <boost/iostreams/stream.hpp>
#include <boost/algorithm/string/replace.hpp>
#include <vector>
namespace io = boost::iostreams;
using boost::tokenizer;
using boost::escaped_list_separator;
typedef tokenizer<escaped_list_separator<char> > so_tokenizer;
using namespace std;
using namespace boost;
vector<string> parser( string );
int main()
{
vector<string> stuff ; // this is the data in a vector
string filename; // this is the input file
string c = ""; // this holds the print line
string sr ;
cout << "Enter filename: " ;
cin >> filename;
//filename = "drwho.csv";
int lastindex = filename.find_last_of("."); // find where the extension begins
string rawname = filename.substr(0, lastindex); // extract the raw name
stuff = parser( filename ); // this gets the data from the file
/** I ask if the user wants a new_index to be created */
cout << "\n\nMySql requires a unique ID field as a Primary Key \n" ;
cout << "If the first field is not unique (no dupicate entries) \nthan you should create a " ;
cout << "New index field for this data.\n" ;
cout << "Not Sure! try no first to maintain data integrity.\n" ;
string ni ;bool invalid_data = true;bool new_index = false ;
do {
cout<<"Should I create a New Index now? (y/n)"<<endl;
cin>>ni;
if ( ni == "y" || ni == "n" ) { invalid_data =false ; }
} while (invalid_data);
cout << "\n" ;
if (ni == "y" )
{
new_index = true ;
sr = rawname.c_str() ; sr.append("_id" ); // new_index field
}
// now make the sql code from the vector stuff
// Create table section
c.append("DROP TABLE IF EXISTS `");
c.append(rawname.c_str() );
c.append("`;");
c.append("\nCREATE TABLE IF NOT EXISTS `");
c.append(rawname.c_str() );
c.append( "` (");
c.append("\n");
if (new_index)
{
c.append( "`");
c.append(sr );
c.append( "` int(10) unsigned NOT NULL,");
c.append("\n");
}
string s = stuff[0];// it is assumed that line zero has fieldnames
int x =0 ; // used to determine if new index is printed
// boost tokenizer code from the Boost website -- tok holds the token
so_tokenizer tok(s, escaped_list_separator<char>('\\', ',', '\"'));
for(so_tokenizer::iterator beg=tok.begin(); beg!=tok.end(); ++beg)
{
x++; // keeps number of fields for later use to eliminate the comma on the last entry
if (x == 1 && new_index == false ) sr = static_cast<string> (*beg) ;
c.append( "`" );
c.append(*beg);
if (x == 1 && new_index == false )
{
c.append( "` int(10) unsigned NOT NULL,");
}
else
{
c.append("` text ,");
}
c.append("\n");
}
c.append("PRIMARY KEY (`");
c.append(sr );
c.append("`)" );
c.append("\n");
c.append( ") ENGINE=InnoDB DEFAULT CHARSET=latin1;");
c.append("\n");
c.append("\n");
// The Create table section is done
// Now make the Insert lines one per line is safer in case you need to split the sql file
for (int w=1; w < stuff.size(); ++w)
{
c.append("INSERT INTO `");
c.append(rawname.c_str() );
c.append("` VALUES ( ");
if (new_index)
{
string String = static_cast<ostringstream*>( &(ostringstream() << w) )->str();
c.append(String);
c.append(" , ");
}
int p = 1 ; // used to eliminate the comma on the last entry
// tokenizer code needs unique name -- stok holds this token
so_tokenizer stok(stuff[w], escaped_list_separator<char>('\\', ',', '\"'));
for(so_tokenizer::iterator beg=stok.begin(); beg!=stok.end(); ++beg)
{
c.append(" '");
string str = static_cast<string> (*beg) ;
boost::replace_all(str, "'", "\\'");
// boost::replace_all(str, "\n", " -- ");
c.append( str);
c.append("' ");
if ( p < x ) c.append(",") ;// we dont want a comma on the last entry
p++ ;
}
c.append( ");\n");
}
// now print the whole thing to an output file
string out_file = rawname.c_str() ;
out_file.append(".sql");
io::stream_buffer<io::file_sink> buf(out_file);
std::ostream out(&buf);
out << c ;
// let the user know that they are done
cout<< "Well if you got here then the data should be in the file " << out_file << "\n" ;
return 0;}
vector<string> parser( string filename )
{
typedef tokenizer< escaped_list_separator<char> > Tokenizer;
escaped_list_separator<char> sep('\\', ',', '\"');
vector<string> stuff ;
string data(filename);
ifstream in(filename.c_str());
string li;
string buffer;
bool inside_quotes(false);
size_t last_quote(0);
while (getline(in,buffer))
{
// --- deal with line breaks in quoted strings
last_quote = buffer.find_first_of('"');
while (last_quote != string::npos)
{
inside_quotes = !inside_quotes;
last_quote = buffer.find_first_of('"',last_quote+1);
}
li.append(buffer);
if (inside_quotes)
{
li.append("\n");
continue;
}
// ---
stuff.push_back(li);
li.clear(); // clear here, next check could fail
}
in.close();
//cout << stuff.size() << endl ;
return stuff ;
}
答案 4 :(得分:0)
您可能会怀疑您的代码行为不正常,因为字段值中的空格。
如果你确实有&#34;简单&#34; CSV中没有字段可能在字段值中包含逗号,那么我将从流操作符和C ++中一起离开。问题中的示例程序仅重新排序字段。没有必要将值实际解释或转换为适当的类型(除非验证也是目标)。单独重新排序是使用awk轻松完成的 super 。例如,以下命令将反转在简单CSV文件中找到的3个字段。
cat infile | awk -F, '{ print $3","$2","$1 }' > outfile
如果目标确实是使用此代码段作为更大更好的创意的启动板......那么我会通过搜索逗号来标记该行。 std :: string类有一个内置的方法来查找偏移特定字符。您可以将此方法视为您想要的优雅或不优雅。最优雅的方法最终看起来像增强标记化代码。
快速而肮脏的方法是只知道你的程序有N个字段并查找相应N-1个逗号的位置。一旦你有了这些位置,调用std :: string :: substr来提取感兴趣的字段非常简单。