Question

我正在尝试将大型数据集（此https://www.kaggle.com/secareanualin/football-events/data）导入cassandra，但我被卡住了。我使用以下命令创建了表：

create table test.football_event(id_odsp text, id_event text, sort_order text, time text, text text, event_type text, event_type2 text, side text, event_team text, opponent text, player text, player2 text, player_in text, player_out text, shot_place text, shot_outcome text, is_goal text, location text, bodypart text, assist_method text, situation text, fast_break text, primary key(id_odsp));

此表与包含数据的csv匹配。当我尝试使用此命令导入时

copy test.football_event(id_odsp, id_event, sort_order, time, text, event_type, event_type2, side, event_team, opponent, player, player2, player_in, player_out, shot_place, shot_outcome, is_goal, location, bodypart, assist_method, situation, fast_break) from '/path/to/events_import.csv' with delimiter = ',';

我收到以下错误Failed to import XX rows: ParseError - Invalid row length 24 should be 23, given up without retries或与row length 23 should be 22相同的错误。我假设csv中的数据并不完美，并且存在一些错误，所以我将表中的列数增加到24，但这并没有解决问题。

我想知道在导入过程中是否存在管理“严格性”级别的选项，但我没有找到任何相关信息。我想要一个选项，允许我在长度为24时填满整个表行，或者如果行长度为23或22，则在最后的字段中添加一个或两个空。

如果它有任何重要性，我在Linux Mint 18.1上运行cassandra

提前致谢

Answer 1

Cassandra / Scylla是架构强制系统，架构应包含任何必需的列。复制命令需要获取与命令的列部分中指定的元素数相同的元素。在Cassandra / Scylla中，复制命令应该在加载器节点上创建一个错误文件，错误文件应该包含“创建”问题的行。您可以查看错误的行并确定它们是否对您感兴趣，并删除/修复它们。

这并不意味着其他行未正确上传。见下面的例子： csv文件如下所示：

cat myfile.csv id,col1,col2,col3,col4 1,bob,alice,charlie,david 2,bob,charlie,david,bob 3,alice,bob,david 4,david,bob,alice

cqlsh> create KEYSPACE myks WITH replication = {'class':'SimpleStrategy', 'replication_factor': 1};

cqlsh> USE myks ;

cqlsh:myks> create TABLE mytable (id int PRIMARY KEY,col1 text,col2 text,col3 text ,col4 text);

cqlsh> COPY myks.mytable (id, col1, col2, col3 , col4 ) FROM 'myfile.csv' WITH HEADER= true  ;

Using 1 child processes

Starting copy of myks.mytable with columns [id, col1, col2, col3, col4].
Failed to import 2 rows: ParseError - Invalid row length 4 should be 5,  given up without retries

Failed to process 2 rows; 

failed rows written to import_myks_mytable.err
Processed: 4 rows; Rate:       7 rows/s; Avg. rate:      10 rows/s
4 rows imported from 1 files in 0.386 seconds (0 skipped).

cqlsh> SELECT * FROM myks.mytable ;

id | col1 | col2    | col3    | col4

----+------+---------+---------+-------

 1 |  bob |   alice | charlie | david

 2 |  bob | charlie |   david |   bob

错误文件说明了哪些行存在问题：

cat import_myks_mytable.err 3,alice,bob,david 4,david,bob,alice

Cassandra副本上的行长度无效（导入）

1 个答案: