我从https://dumps.wikimedia.org/enwiki/20170520/
下载了一个大型wiki转储XML文件我想从此wikidumps中提取元数据公司名称和母公司。所有公司数据都位于XML模板中,如下所示:
{{Infobox company
| name =
| logo =
| type =
| industry =
| fate =
| predecessor = <!-- or: | predecessors = -->
| successor = <!-- or: | successors = -->
| founded = <!-- if known: {{Start date and age|YYYY|MM|DD}} in [[city]], [[state]], [[country]] -->
| founder = <!-- or: | founders = -->
| defunct = <!-- {{End date|YYYY|MM|DD}} -->
| hq_location_city =
| hq_location_country =
| area_served = <!-- or: | areas_served = -->
| key_people =
| products =
| owner = <!-- or: | owners = -->
| num_employees =
| num_employees_year = <!-- Year of num_employees data (if known) -->
| parent =
| website = <!-- {{URL|example.com}} -->
}}
我做了一些研究,发现了MediaWiki Parser。 参考:https://github.com/dkpro/dkpro-jwpl/blob/master/de.tudarmstadt.ukp.wikipedia.parser/src/main/java/de/tudarmstadt/ukp/wikipedia/parser/tutorial/T1_SimpleParserDemo.java
https://dkpro.github.io/dkpro-jwpl/JWPLParser/
我尝试使用此解析器。但它需要将文件转换为字符串。我的wiki转储XML文件大小为60 GB。我无法用字符串转换这个大文件并保留在内存中。此外,Mediawiki解析器没有关于如何查找 Infobox公司等特定元素的说明,进入其中并提取名称和其他字段。以下是Mediawiki解析器的示例代码:
public static void main(String[] args) throws IOException {
File file = new File("C:/Users/njaiswal/Downloads/accenture_data_from_wikidumps.xml");
String str = FileUtils.readFileToString(file);
// get a ParsedPage object
MediaWikiParserFactory pf = new MediaWikiParserFactory();
MediaWikiParser parser = pf.createParser();
ParsedPage pp = parser.parse(str);
// get the sections
for (Section section : pp.getSections()) {
System.out.println("section : " + section.getTitle());
System.out.println(" nr of paragraphs : " + section.nrOfParagraphs());
System.out.println(" nr of tables : " + section.nrOfTables());
System.out.println(" nr of nested lists : " + section.nrOfNestedLists());
System.out.println(" nr of definition lists: " + section.nrOfDefinitionLists());
for (Link link : section.getLinks(Link.type.INTERNAL)) {
System.out.println(" " + link.getTarget());
}
}
}
还有其他解析器可以解决我的问题吗?或者我可以使用相同的MediaWiki Parser来访问&#34; Inbox公司&#34;并提取字段?任何帮助表示赞赏。感谢
更新:我试图使用Khalil建议的wikiXMLj解析器。我能够得到所有的&#34;信息框&#34;数据,但我想将此限制为&#34; Infobox公司&#34;数据。以下是我的代码和输出:
import edu.jhu.nlp.wikipedia.*;
public class Test {
public static void main(String[] args) throws Exception{
WikiXMLParser parser = WikiXMLParserFactory.getSAXParser("C:/Users/njaiswal/Downloads/enwiki-20170520-pages-articles-multistream.xml/enwiki-20170520-pages-articles-multistream.xml");
parser.setPageCallback(new PageCallbackHandler() {
public void process(WikiPage page) {
try {
InfoBox infobox=page.getInfoBox();
System.out.println(infobox.dumpRaw());
} catch (WikiTextParserException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
//do something with info box
}
});
parser.parse();
}
}
O / P:
{{Infobox Monarch
| name = Attila
| title = [[List of Hunnic rulers|Ruler]] of the [[Hunnic Empire]]
| place of burial =
}}
{{Infobox sea
| name = Aegean Sea
| image = Aegean Sea map.png
| caption = Map of the Aegean Sea
| pushpin_map = World
| pushpin_map_alt = World
| pushpin_label_position = right
}}
{{Infobox company
| name = Audi AG
| logo = Audi-Logo 2016.svg
| logo_size = 235
| image = Audi Ingolstadt.jpg
| image_size = 265
}}
答案 0 :(得分:0)
我在wikixmlj非常简单的哑语解析器之前使用过。这将完美地解析它:
// dumpPath should be like C:\your/Path/articles.xml.bz2"
WikiXMLParser wxsp = WikiXMLParserFactory.getSAXParser(dumpPath);
wxsp.setPageCallback(new PageCallbackHandler() {
@Override
public void process(WikiPage page) {
//System.out.println("info box:" + page.getInfoBox());
String regex = "\\{{Infobox company(.|\\n)+";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(page.getInfoBox());
while (matcher.find()) {
System.out.println(matcher.group(0));}
}
});
wxsp.parse(); }