Question

我的场景是，我尝试使用file python代码获取特定的AWS S3存储文本word count language detection及其AWS lambda。在这里，我正在尝试下面的代码。它提供行数，但我不知道如何获得字数和语言检测。请提供一些获取文件字数和语言检测的方法。

我尝试了行数

import boto3

def lambda_handler(event, context):

    # create the s3 resource
    s3 = boto3.resource('s3')

    # get the file object
    obj = s3.Object('bucket name', 'sample.txt')

    # read the file contents in memory
    file_contents = obj.get()["Body"].read()

    # print the occurrences of the new line character to get the number of lines
    # print file_contents.count('\n')
    # TODO implement
    return {
        'Line Count': file_contents.count('\n')
    }

当前响应：       {         “行数”：48，       }

预期响应：       {         “行数”：48，         “字数”：：？，//这里我想显示字数         “语言”：？ //这里的语言名称       }

Answer 1

要获取单词数，可以尝试以下列出的任何方法：How to count the number of words in a sentence, ignoring numbers, punctuation and whitespace?

要检测语言，您可以尝试以下列出的方法之一：NLTK and language detection

不幸的是，您的问题相当广泛。此外，检测文本语言的任务很难正确完成。获得单词计数很容易，但是很大程度上取决于您要定义的单词。

如何使用Lambda从AWS S3存储中获取特定的文件字数？

1 个答案: