Question

我将一堆压缩的tsv文件推送到了S3，雅典娜正在解析这些文件。但是，字符串字段只是无法按预期方式工作。任何相等运算或LIKE运算符都根本不起作用。

表：

CREATE EXTERNAL TABLE Archives.Events(
    Id string,     --intentionally string
    DateCreated string,
    EventType smallint,
    EventDescription string,
    UserId int,
    UserName string
)
PARTITIONED BY ( 
  `year` int, 
  `month` int, 
  `day` int)
  ROW FORMAT DELIMITED 
    fields terminated by '\t' 
    lines terminated by '\n' 
location 's3://mybucket/Archives/Events'
tblproperties ("skip.header.line.count"="1");

问题：

雅典娜已经解析了所有内容。现在，假设有一个用户名'foo'。

--nothing returned
Select *
From events
Where username = 'foo'


--nothing returned
Select *
From events
Where username LIKE ‘%foo%’


--records returned
Select *
From events
Where username LIKE ‘%f%’


--nothing returned
Select *
From events
Where username LIKE ‘f%’

我用C＃构建文件，并用System.Text.Encoding.UTF8对其进行编码。另外，我使用GZipStream压缩了它们。也许我应该尝试使用varchar重新创建表，但是string似乎是...字符串字段的推荐类型！

Answer 1

我怀疑您的字段被引用了。 LazySimpleSerDe，这就是您说ROW FORMAT DELIMITED时得到的。 LazySimpleSerDe不会删除字段中的引号。

要么必须将数据更改为不带引号，要么使用支持引号的OpenCSVSerDe。

Answer 2

问题的症结在于编码。

（BAD）编码。Unicode=字符串字段在Athena中返回，但不可搜索。

using (MemoryStream memoryStream = new MemoryStream())
        using (StreamWriter streamWriter = new StreamWriter(memoryStream, Encoding.UTF8))
        {
            writeToStream(rows, streamWriter);

            bytes = compress(memoryStream);
        }

（良好）编码。UTF8=返回字符串字段并且可以搜索。

token = nltk.word_tokenize(y) #first we break the string into individual words

bgs=nltk.bigrams(token) #next we apply the bigram method to break it up into chunks of two

fdist = nltk.FreqDist(bgs) #this gets us the frequency distribution

twowords = fdist.items()

bigrams = pd.DataFrame.from_dict(twowords) #create a dataframe of the bigrams

bigrams.bigrams = bigrams.bigrams.str.join(' ') #make it into a string

bigrams_50 = bigrams.nlargest(50,'frequency')

为什么AWS Athena不喜欢我的字符串字段？

2 个答案: