我在Amazon S3
中有一个JSON文件(.json)。我需要阅读它并为每个JsonObject
创建一个名为Hash_index的新字段。该文件非常大,所以我使用GSON
库来避免我的OutOfMemoryError读取文件。以下是我的代码。请注意,我使用的是GSON
//Create the Hashed JSON
public void createHash() throws IOException
{
System.out.println("Hash Creation Started");
strBuffer = new StringBuffer("");
try
{
//List all the Buckets
List<Bucket>buckets = s3.listBuckets();
for(int i=0;i<buckets.size();i++)
{
System.out.println("- "+(buckets.get(i)).getName());
}
//Downloading the Object
System.out.println("Downloading Object");
S3Object s3Object = s3.getObject(new GetObjectRequest(inputBucket, inputFile));
System.out.println("Content-Type: " + s3Object.getObjectMetadata().getContentType());
//Read the JSON File
/*BufferedReader reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));
while (true) {
String line = reader.readLine();
if (line == null) break;
// System.out.println(" " + line);
strBuffer.append(line);
}*/
// JSONTokener jTokener = new JSONTokener(new BufferedReader(new InputStreamReader(s3Object.getObjectContent())));
// jsonArray = new JSONArray(jTokener);
JsonReader reader = new JsonReader( new BufferedReader(new InputStreamReader(s3Object.getObjectContent())) );
reader.beginArray();
int gsonVal = 0;
while (reader.hasNext()) {
JsonParser _parser = new JsonParser();
JsonElement jsonElement = _parser.parse(reader);
JsonObject jsonObject1 = jsonElement.getAsJsonObject();
//Do something
StringBuffer hashIndex = new StringBuffer("");
//Add Title and Body Together to the list
String titleAndBodyContainer = jsonObject1.get("title")+" "+jsonObject1.get("body");
//Remove full stops and commas
titleAndBodyContainer = titleAndBodyContainer.replaceAll("\\.(?=\\s|$)", " ");
titleAndBodyContainer = titleAndBodyContainer.replaceAll(",", " ");
titleAndBodyContainer = titleAndBodyContainer.toLowerCase();
//Create a word list without duplicated words
StringBuilder result = new StringBuilder();
HashSet<String> set = new HashSet<String>();
for(String s : titleAndBodyContainer.split(" ")) {
if (!set.contains(s)) {
result.append(s);
result.append(" ");
set.add(s);
}
}
//System.out.println(result.toString());
//Re-Arranging everything into Alphabetic Order
String testString = "acarpous barnyard gleet diabolize acarus creosol eaten gleet absorbance";
//String testHash = "057 1$k 983 5*1 058 52j 6!v 983 03z";
String[]finalWordHolder = (result.toString()).split(" ");
Arrays.sort(finalWordHolder);
//Navigate through text and create the Hash
for(int arrayCount=0;arrayCount<finalWordHolder.length;arrayCount++)
{
if(wordMap.containsKey(finalWordHolder[arrayCount]))
{
hashIndex.append((String)wordMap.get(finalWordHolder[arrayCount]));
}
}
//System.out.println(hashIndex.toString().trim());
jsonObject1.addProperty("hash_index", hashIndex.toString().trim());
jsonObject1.addProperty("primary_key", gsonVal);
jsonObjectHolder.add(jsonObject1); //Add the JSON Object to the JSON collection
jsonHashHolder.add(hashIndex.toString().trim());
System.out.println("Primary Key: "+jsonObject1.get("primary_key"));
//System.out.println(Arrays.toString(finalWordHolder));
//System.out.println("- "+hashIndex.toString());
//break;
gsonVal++;
}
System.out.println("Hash Creation Completed");
}
catch(Exception e)
{
e.printStackTrace();
}
}
执行此代码时,出现以下错误
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2894)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:407)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at HashCreator.createHash(HashCreator.java:252)
at HashCreator.<init>(HashCreator.java:66)
at Main.main(Main.java:9)
[root@ip-172-31-45-123 JarFiles]#
第252行 - result.append(s);
。它位于HashSet
循环内。
之前,它在第254行生成了OutOfMemoryError
。 第254行 - set.add(s);
它也位于{{1}内} array。
我的Json文件非常大。千兆字节和太字节。我不知道如何避免上述问题。
答案 0 :(得分:1)
使用像Jackson这样的流媒体JSON库。 读入一些JSON,添加哈希值,然后写出来。 然后再读一下,处理它们,然后把它们写出来。 继续,直到你处理完所有对象。
http://wiki.fasterxml.com/JacksonInFiveMinutes#Streaming_API_Example
(另请参阅此StackOverflow帖子:Is there a streaming API for JSON?)