为什么AWS Athena不喜欢我的字符串字段?

时间:2019-08-09 01:39:11

标签: c# amazon-athena

我将一堆压缩的tsv文件推送到了S3,雅典娜正在解析这些文件。但是,字符串字段只是无法按预期方式工作。任何相等运算或LIKE运算符都根本不起作用。

CREATE EXTERNAL TABLE Archives.Events(
    Id string,     --intentionally string
    DateCreated string,
    EventType smallint,
    EventDescription string,
    UserId int,
    UserName string
)
PARTITIONED BY ( 
  `year` int, 
  `month` int, 
  `day` int)
  ROW FORMAT DELIMITED 
    fields terminated by '\t' 
    lines terminated by '\n' 
location 's3://mybucket/Archives/Events'
tblproperties ("skip.header.line.count"="1");

问题:

雅典娜已经解析了所有内容。现在,假设有一个用户名'foo'。

--nothing returned
Select *
From events
Where username = 'foo'


--nothing returned
Select *
From events
Where username LIKE ‘%foo%’


--records returned
Select *
From events
Where username LIKE ‘%f%’


--nothing returned
Select *
From events
Where username LIKE ‘f%’

我用C#构建文件,并用System.Text.Encoding.UTF8对其进行编码。另外,我使用GZipStream压缩了它们。也许我应该尝试使用varchar重新创建表,但是string似乎是...字符串字段的推荐类型!

2 个答案:

答案 0 :(得分:0)

我怀疑您的字段被引用了。 LazySimpleSerDe,这就是您说ROW FORMAT DELIMITED时得到的。 LazySimpleSerDe不会删除字段中的引号。

要么必须将数据更改为不带引号,要么使用支持引号的OpenCSVSerDe

答案 1 :(得分:0)

问题的症结在于编码。

(BAD)编码。Unicode=字符串字段在Athena中返回,但不可搜索。

using (MemoryStream memoryStream = new MemoryStream())
        using (StreamWriter streamWriter = new StreamWriter(memoryStream, Encoding.UTF8))
        {
            writeToStream(rows, streamWriter);

            bytes = compress(memoryStream);
        }

(良好)编码。UTF8=返回字符串字段并且可以搜索。

token = nltk.word_tokenize(y) #first we break the string into individual words

bgs=nltk.bigrams(token) #next we apply the bigram method to break it up into chunks of two

fdist = nltk.FreqDist(bgs) #this gets us the frequency distribution

twowords = fdist.items()

bigrams = pd.DataFrame.from_dict(twowords) #create a dataframe of the bigrams

bigrams.bigrams = bigrams.bigrams.str.join(' ') #make it into a string

bigrams_50 = bigrams.nlargest(50,'frequency')