Question

I'm generating Turtle triples, full dataset already about 2GB. I work on a small sample of a few K for most testing. Then I attempt a periodic test on the full dataset. It never loads all the way, but it tells me if there are errors.

My quick test is to load the ttl file into protege. I'm using Protege 5.2 (the windows version). There are no errors in the small samples. But when I larger samples it (protege) reads in the ttl file I generated and tells me there's an error.

import pandas as pd
from pandas.io import sql
import feedparser
import time

rawrss = ['http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
          'https://www.yahoo.com/news/rss/',
          'http://www.huffingtonpost.co.uk/feeds/index.xml',
          'http://feeds.feedburner.com/TechCrunch/',
         ]

posts = []
for url in rawrss:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((post.title, post.link, post.summary))
df = pd.DataFrame(posts, columns=['title', 'link', 'summary']) # pass data to init

import pymysql

# Open database connection
db = pymysql.connect(host="host", port=##, user="username", password="password", db="sql#######" )

# prepare a cursor object using cursor() method
cursor = db.cursor()


# Drop table if it already exist using execute() method.
cursor.execute("DROP TABLE IF EXISTS rsstracker")

# Create table as per requirement
sql = """CREATE TABLE rsstracker(
   article_title  varchar(255),
   article_url  varchar(1000),
   article_summary varchar(1000))"""

cursor.execute(sql)

sql.to_sql(df, con=conn, name='rsstracker', if_exists='append', flavor='mysql')

# disconnect from server
db.close()

It can take a very long time to load these sample files- and then it only tells me there was an error without any indication of where the problem was. So my current method of debugging is binary search - generate file half as large, see if there is an error, split the difference, check for error, and that way I narrow it down to a few lines in which I can easily detect the error. This is really tedious. Is there a way to get protege to report the line where it puked?

If not, perhaps there is another tool can I use to check the syntax of the triples I generate?

Answer 1

解析器中没有引发内存不足错误，因此没有要提供的行号。只能通过连续尝试来猜测可以加载内存限制的行数。

最佳解决方法是增加-Xmx参数值。

Can protege (ontology tool) report line number of an error when reading a turtle file?

1 个答案: