我想基于Weka j48决策树输出使用Jena创建一个本体。但是在将输出输入到Jena之前,需要将此输出映射到RDF格式。有没有办法做这个映射?
EDIT1:
映射前j48决策树输出的示例部分:
与决策树输出相对应的RDF的示例部分:
这两个屏幕来自本研究论文(幻灯片4):
答案 0 :(得分:2)
可能没有内置方法可以做到这一点。
免责声明:我之前从未与Jena和RDF合作过。所以这个答案可能不完整或错过了预期的转换。
但是,首先,这是一个短暂的咆哮:
<rant>
<子> 本文中发表的片段(即Weka分类器和RDF的输出)不完整且明显不一致。转换过程未完全描述 。相反,他们只提到: 子>
我们面临的挑战主要是让J48分类输出到RDF并将其交给Jena
(原文如此!)
<子> 现在,他们以某种方式解决了它。他们本可以在公共的开源存储库中提供转换代码。这将允许其他人提供改进,这将提高他们的方法的可见性和可验证性。但相反,他们浪费了他们的时间和读者的时间与各种网站的截图作为页面填充程序,可怜的尝试挤出他们的方法的另一个出版物。 子>
</rant>
以下是我尽最大努力提供转换所需的一些构建块。它必须用一点点盐,因为我不熟悉底层的方法和库。但我希望它可以被认为是“有用的”。
Weka Classifier
实现通常不提供它们用于内部工作的结构。因此无法直接访问内部树结构 。但是,有一个方法prefix()
返回树的字符串表示。
下面的代码包含一个非常实用(因此有些脆弱......)的方法,该方法解析此字符串并构建包含相关信息的树结构。此结构由TreeNode
个对象组成:
static class TreeNode
{
String label;
String attribute;
String relation;
String value;
...
}
label
是用于分类器的类标签。对于叶节点,这只是非null
。对于本文中的示例,这将是"0"
或"1"
,表明电子邮件是否为垃圾邮件。
attribute
是决策所依据的属性。对于本文中的示例,此类属性可以是word_freq_remove
relation
和value
是表示决策标准的字符串。例如,这些可能是"<="
和"0.08"
。
创建了这样的树结构后,可以将其转换为Apache Jena Model
实例。代码包含这样的转换方法,但由于我不熟悉RDF,我不确定它在概念上是否有意义。为了从该树结构中创建“期望的”RDF结构,可能需要进行调整。但天真地看,输出看起来很有意义。
import java.io.FileInputStream;
import java.util.ArrayList;
import java.util.List;
import org.apache.jena.rdf.model.Model;
import org.apache.jena.rdf.model.ModelFactory;
import org.apache.jena.rdf.model.Property;
import org.apache.jena.rdf.model.Resource;
import org.apache.jena.rdf.model.Statement;
import weka.classifiers.trees.J48;
import weka.core.Instances;
import weka.core.converters.ArffLoader;
public class WekaClassifierToRdf
{
public static void main(String[] args) throws Exception
{
String fileName = "./data/iris.arff";
ArffLoader arffLoader = new ArffLoader();
arffLoader.setSource(new FileInputStream(fileName));
Instances instances = arffLoader.getDataSet();
instances.setClassIndex(4);
//System.out.println(instances);
J48 classifier = new J48();
classifier.buildClassifier(instances);
System.out.println(classifier);
String prefixTreeString = classifier.prefix();
TreeNode node = processPrefixTreeString(prefixTreeString);
System.out.println("Tree:");
System.out.println(node.createString());
Model model = createModel(node);
System.out.println("Model:");
model.write(System.out, "RDF/XML-ABBREV");
}
private static TreeNode processPrefixTreeString(String inputString)
{
String string = inputString.replaceAll("\\n", "");
//System.out.println("Input is " + string);
int open = string.indexOf("[");
int close = string.lastIndexOf("]");
String part = string.substring(open + 1, close);
//System.out.println("Part " + part);
int colon = part.indexOf(":");
if (colon == -1)
{
TreeNode node = new TreeNode();
int openAfterLabel = part.lastIndexOf("(");
String label = part.substring(0, openAfterLabel).trim();
node.label = label;
return node;
}
String attributeName = part.substring(0, colon);
//System.out.println("attributeName " + attributeName);
int comma = part.indexOf(",", colon);
int leftOpen = part.indexOf("[", comma);
String leftCondition = part.substring(colon + 1, comma).trim();
String rightCondition = part.substring(comma + 1, leftOpen).trim();
int leftSpace = leftCondition.indexOf(" ");
String leftRelation = leftCondition.substring(0, leftSpace).trim();
String leftValue = leftCondition.substring(leftSpace + 1).trim();
int rightSpace = rightCondition.indexOf(" ");
String rightRelation = rightCondition.substring(0, rightSpace).trim();
String rightValue = rightCondition.substring(rightSpace + 1).trim();
//System.out.println("leftCondition " + leftCondition);
//System.out.println("rightCondition " + rightCondition);
int leftClose = findClosing(part, leftOpen + 1);
String left = part.substring(leftOpen, leftClose + 1);
//System.out.println("left " + left);
int rightOpen = part.indexOf("[", leftClose);
int rightClose = findClosing(part, rightOpen + 1);
String right = part.substring(rightOpen, rightClose + 1);
//System.out.println("right " + right);
TreeNode leftNode = processPrefixTreeString(left);
leftNode.relation = leftRelation;
leftNode.value = leftValue;
TreeNode rightNode = processPrefixTreeString(right);
rightNode.relation = rightRelation;
rightNode.value = rightValue;
TreeNode result = new TreeNode();
result.attribute = attributeName;
result.children.add(leftNode);
result.children.add(rightNode);
return result;
}
private static int findClosing(String string, int startIndex)
{
int stack = 0;
for (int i=startIndex; i<string.length(); i++)
{
char c = string.charAt(i);
if (c == '[')
{
stack++;
}
if (c == ']')
{
if (stack == 0)
{
return i;
}
stack--;
}
}
return -1;
}
static class TreeNode
{
String label;
String attribute;
String relation;
String value;
List<TreeNode> children = new ArrayList<TreeNode>();
String createString()
{
StringBuilder sb = new StringBuilder();
createString("", sb);
return sb.toString();
}
private void createString(String indent, StringBuilder sb)
{
if (children.isEmpty())
{
sb.append(indent + label);
}
sb.append("\n");
for (TreeNode child : children)
{
sb.append(indent + "if " + attribute + " " + child.relation
+ " " + child.value + ": ");
child.createString(indent + " ", sb);
}
}
@Override
public String toString()
{
return "TreeNode [label=" + label + ", attribute=" + attribute
+ ", relation=" + relation + ", value=" + value + "]";
}
}
private static String createPropertyString(TreeNode node)
{
if ("<".equals(node.relation))
{
return "lt_" + node.value;
}
if ("<=".equals(node.relation))
{
return "lte_" + node.value;
}
if (">".equals(node.relation))
{
return "gt_" + node.value;
}
if (">=".equals(node.relation))
{
return "gte_" + node.value;
}
System.err.println("Unknown relation: " + node.relation);
return "UNKNOWN";
}
static Model createModel(TreeNode node)
{
Model model = ModelFactory.createDefaultModel();
String baseUri = "http://www.example.com/example#";
model.createResource(baseUri);
model.setNsPrefix("base", baseUri);
populateModel(model, baseUri, node, node.attribute);
return model;
}
private static void populateModel(Model model, String baseUri,
TreeNode node, String resourceName)
{
//System.out.println("Populate with " + resourceName);
for (TreeNode child : node.children)
{
if (child.label != null)
{
Resource resource =
model.createResource(baseUri + resourceName);
String propertyString = createPropertyString(child);
Property property =
model.createProperty(baseUri, propertyString);
Statement statement = model.createLiteralStatement(resource,
property, child.label);
model.add(statement);
}
else
{
Resource resource =
model.createResource(baseUri + resourceName);
String propertyString = createPropertyString(child);
Property property =
model.createProperty(baseUri, propertyString);
String nextResourceName = resourceName + "_" + child.attribute;
Resource childResource =
model.createResource(baseUri + nextResourceName);
Statement statement =
model.createStatement(resource, property, childResource);
model.add(statement);
}
}
for (TreeNode child : node.children)
{
String nextResourceName = resourceName + "_" + child.attribute;
populateModel(model, baseUri, child, nextResourceName);
}
}
}
程序从ARFF文件解析着名的Iris数据集,运行J48分类器,构建树结构并生成并打印RDF模型。输出如下所示:
分类器,由Weka打印:
J48 pruned tree
------------------
petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
Number of Leaves : 5
Size of the tree : 9
内部构建的树结构的字符串表示形式:
Tree:
if petalwidth <= 0.6: Iris-setosa
if petalwidth > 0.6:
if petalwidth <= 1.7:
if petallength <= 4.9: Iris-versicolor
if petallength > 4.9:
if petalwidth <= 1.5: Iris-virginica
if petalwidth > 1.5: Iris-versicolor
if petalwidth > 1.7: Iris-virginica
生成的RDF模型:
Model:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:base="http://www.example.com/example#">
<rdf:Description rdf:about="http://www.example.com/example#petalwidth">
<base:gt_0.6>
<rdf:Description rdf:about="http://www.example.com/example#petalwidth_petalwidth">
<base:gt_1.7>Iris-virginica</base:gt_1.7>
<base:lte_1.7>
<rdf:Description rdf:about="http://www.example.com/example#petalwidth_petalwidth_petallength">
<base:gt_4.9>
<rdf:Description rdf:about="http://www.example.com/example#petalwidth_petalwidth_petallength_petalwidth">
<base:gt_1.5>Iris-versicolor</base:gt_1.5>
<base:lte_1.5>Iris-virginica</base:lte_1.5>
</rdf:Description>
</base:gt_4.9>
<base:lte_4.9>Iris-versicolor</base:lte_4.9>
</rdf:Description>
</base:lte_1.7>
</rdf:Description>
</base:gt_0.6>
<base:lte_0.6>Iris-setosa</base:lte_0.6>
</rdf:Description>
</rdf:RDF>