我将一堆压缩的tsv文件推送到了S3,雅典娜正在解析这些文件。但是,字符串字段只是无法按预期方式工作。任何相等运算或LIKE运算符都根本不起作用。
表:
CREATE EXTERNAL TABLE Archives.Events(
Id string, --intentionally string
DateCreated string,
EventType smallint,
EventDescription string,
UserId int,
UserName string
)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
ROW FORMAT DELIMITED
fields terminated by '\t'
lines terminated by '\n'
location 's3://mybucket/Archives/Events'
tblproperties ("skip.header.line.count"="1");
问题:
雅典娜已经解析了所有内容。现在,假设有一个用户名'foo'。
--nothing returned
Select *
From events
Where username = 'foo'
--nothing returned
Select *
From events
Where username LIKE ‘%foo%’
--records returned
Select *
From events
Where username LIKE ‘%f%’
--nothing returned
Select *
From events
Where username LIKE ‘f%’
我用C#构建文件,并用System.Text.Encoding.UTF8
对其进行编码。另外,我使用GZipStream
压缩了它们。也许我应该尝试使用varchar重新创建表,但是string
似乎是...字符串字段的推荐类型!
答案 0 :(得分:0)
我怀疑您的字段被引用了。 LazySimpleSerDe
,这就是您说ROW FORMAT DELIMITED
时得到的。 LazySimpleSerDe
不会删除字段中的引号。
要么必须将数据更改为不带引号,要么使用支持引号的OpenCSVSerDe
。
答案 1 :(得分:0)
问题的症结在于编码。
(BAD)编码。Unicode=字符串字段在Athena中返回,但不可搜索。
using (MemoryStream memoryStream = new MemoryStream())
using (StreamWriter streamWriter = new StreamWriter(memoryStream, Encoding.UTF8))
{
writeToStream(rows, streamWriter);
bytes = compress(memoryStream);
}
(良好)编码。UTF8=返回字符串字段并且可以搜索。
token = nltk.word_tokenize(y) #first we break the string into individual words
bgs=nltk.bigrams(token) #next we apply the bigram method to break it up into chunks of two
fdist = nltk.FreqDist(bgs) #this gets us the frequency distribution
twowords = fdist.items()
bigrams = pd.DataFrame.from_dict(twowords) #create a dataframe of the bigrams
bigrams.bigrams = bigrams.bigrams.str.join(' ') #make it into a string
bigrams_50 = bigrams.nlargest(50,'frequency')