我有两个大文件,我从Stackoverflow收集了一个名为posts.xml
和questions.txt
的文件,结构如下:
posts.xml:
<posts>
<row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="322" ViewCount="21888" Body="..."/>
<row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="2008-07-31T22:08:08.620" Score="140" ViewCount="10912" Body="..." />
...
</posts>
帖子可以是问题或答案(两者)
questions.txt:
Id,CreationDate,CreationDatesk,Score
123,2008-08-01 16:08:52,20080801,48
126,2008-08-01 16:10:30,20080801,33
...
我想在帖子上查询一次,并使用lucene索引所选行(其ID在questions.txt
文件中)。由于xml文件非常大(大约50GB),查询和索引的时间对我来说很重要。
现在问题是:如何找到posts.xml
中重复的所有选定行questions.txt
到目前为止,这是我的方法:
SAXParserDemo.java:
public class SAXParserDemo {
public static void main(String[] args){
try {
File inputFile = new File("D:\\University\\Information Retrieval 2\\Hws\\Hw1\\files\\Posts.xml");
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
UserHandler userhandler = new UserHandler();
saxParser.parse(inputFile, userhandler);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Handler.java:
public class Handler extends DefaultHandler {
public void getQuestiondId() {
ArrayList<String> qIDs = new ArrayList<String>();
BufferedReader br = null;
try {
String qId;
br = new BufferedReader(new FileReader("D:\\University\\Information Retrieval 2\\Hws\\Hw1\\files\\Q.txt"));
while ((qId = br.readLine()) != null) {
qId = qId.split(",")[0]; //this is question id
findAndIndexOnPost(qId); //find this id on posts.xml then index it!
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
private void findAndIndexOnPost(String qID) {
}
@Override
public void startElement(String uri,
String localName, String qName, Attributes attributes)
throws SAXException {
if (qName.equalsIgnoreCase("row")) {
System.out.println(attributes.getValue("Id"));
switch (attributes.getValue("PostTypeId")) {
case "1":
String id = attributes.getValue("Id");
break;
case "2":
break;
default:
break;
}
}
}
}
更新
我需要在每次迭代中将指针保持在xml文件中。但是对于SAX,我不知道该怎么做。
答案 0 :(得分:1)
What you have to do is:
Id
values to a List<Integer> questionIds
- one by one. You will have to parse them manually (with a regex or String.indexOf()
).questionIds.contains(givenId)
.Ta-da! Your data is now indexed with lucene.
Also, change the way you pass data to SAX Parser. Instead of giving it a File
, create an implementation of InputStream
for it which you can give to saxParser.parse(inputStream, userhandler);
. Info on getting position in a stream here: Given a Java InputStream, how can I determine the current offset in the stream?.