PIG中的REGEX_EXTRACT错误

时间:2015-06-08 03:54:15

标签: apache-pig

我有一个包含3列的CSV文件:func tableView(tableView: UITableView, numberOfRowsInSection section: Int) -> Int{ if section == 0{ if let temp = googleDicCount{ return googleDicCount! //this line gives me crash }else{ return 0 } }else if section == 1{ if let temp = foursquareDicCount{ return foursquareDicCount! }else{ return 0 } } return 1 } tweetidtweet。但是,在Userid列中有逗号分隔值。

即。 1行数据:

tweet

我想单独提取所有3个字段,但`396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143 给出了一个错误:

REGEX_EXTRACT

错误是:

a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3);

b = FILTER a BY REGEX_EXTRACT(f1,'(.*)\\"(.*)',1);

2 个答案:

答案 0 :(得分:2)

在共享的用例中,使用PigStrorage(',')读取数据将导致缺少savava143(最后一个字段值)

A = LOAD '/Users/muralirao/learning/pig/a.csv' USING PigStorage(',') AS (f1,f2,f3);
DUMP A;

输出:A:观察到缺少最后一个字段值。

(396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.")

对于共享的用例,要从CSV文件中提取所有值,其字段值为','我们可以使用CSVExcelStorage或CSVLoader。

方法1:使用CSVExcelStorage

参考:http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html

输入:a.csv

396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143

猪脚本:

REGISTER piggybank.jar;
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (f1,f2,f3); 
DUMP A;

输出:A

(396124437168537600,I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.,savava143)

方法2:使用CSVLoader

参考:http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html

下面的脚本使用了CSVLoader(),DUMP A会产生前面看到的相同输出。

A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (f1,f2,f3);

答案 1 :(得分:0)

错误是您不希望FILTER基于正则表达式,而GENERATE新字段基于正则表达式。要进行过滤,您需要知道是否必须过滤行,因此需要布尔要求。

因此,您必须使用:

b = FOREACH a GENERATE REGEX_EXTRACT(FIELD, REGEX, HOW_MANY_GROUPS_TO_RETURN);

然而,正如@Murali Rao所说,你的价值不仅仅是昏迷,而是CSV(想想你将如何处理推文中的昏迷:它不是字段分隔符,只是一些内容)。