Question

我尝试处理大型数据集，但数据的格式结构已分成数百个目录。

数据/：   0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v x x z

数据/ 0：   0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s s w w y y z

数据/ 1：   0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s s w w y y z

数据/ 2：   0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s s w w y y z

数据/ 3：   0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s s w w y y z

数据/ 4：   0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s s w w y y z

数据/ 5：   0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s s w w y y z

数据/ 6：   0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s s w w y y z

数据/ 7：   0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s s w w y y z

数据/ 8：   0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s s w w y y z

数据/ 9：   0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s s w w y y z

数据/ A：   0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s s w w y y z

此外，文件类型也是完全随机的。

0：UTF-8 Unicode文本

1：UTF-8 Unicode文本

2：UTF-8 Unicode文本

3：UTF-8 Unicode文本

4：UTF-8 Unicode文本

5：非ISO扩展ASCII文本，带有LF，NEL行终止符

6：UTF-8 Unicode文本

7：UTF-8 Unicode文本

8：UTF-8 Unicode文本

9：UTF-8 Unicode文本

a：UTF-8 Unicode文本

...

z：UTF-8 Unicode文本

文件包含email:password格式。

如何将所有内容都放入JSON文件或CSV文件？

我希望将数据导入MongoDB。

感谢。

Answer 1

我相信有人会比我更好地帮助你，但如果我能指出你正确的方向我会。

您是否尝试制作perl脚本？即

    opendir(DIR, ".");
 @files = grep(/\.cnf$/,readdir(DIR));
 closedir(DIR);

 foreach $file (@files) {
    //shuv in a JSON file
}

那样的东西？

Answer 2

问题是用python标记的，所以我建议os.walk()（documentation）以递归方式读取文件。类似的东西：

# path is the path to the data
for subdir, dirs, files in os.walk(path):
    for file in files:
        file_path = os.path.join(subdir, file)
        try:
            read_file(file_path) # This is where you read the files and push to mongo etc 
        except:
            continue

关于阅读非ISO扩展ASCII英文文本的第二部分，有一些答案可能对此有帮助：File encoding from English text to UTF-8

如何在Linux上递归提取数据？

2 个答案: