我正在使用GUI程序,weka知识资源管理器。设置我的管道来训练分类模型。这是我的数据的一小部分。唯一的属性是来自文本的值。由于它是受监督的学习,我在那里有每个推文/文档的标签/类别。
[
{
"id": 8.7361726140328e+17,
"text": "The Joki's on you! Unless you take advantage of 25% off Scarlet Court Chests - on sale now! https:\/\/t.co\/vc1ttPxJWm",
"category": [
"dont_care"
]
},
{
"id": 8.7329941695388e+17,
"text": "Don't be a drag - dress like a queen! Scarlet Court Chest Rolls are 25% off! https:\/\/t.co\/O0Ig5bEZdD",
"category": [
"dont_care"
]
},
{
"id": 8.7328034547452e+17,
"text": "Join @Inukii and @MezmoreyezTV for Top 5 Console Plays! https:\/\/t.co\/3JmreXSTWp",
"category": [
"dont_care"
]
}
]
我在日志中获得的例外
11:16:12: [Low] FlowRunner$1697181913|FlowRunner: Launching start point: JSONLoader
11:16:12: [Basic] JSONLoader$17081058|Loading /home/j/_Github-Projects/GameMediaBot/SmiteGame_classified_data.json
11:16:12: [ERROR] JSONLoader$17081058|java.lang.Exception: Can't recover from previous error(s)
weka.core.WekaException: java.lang.Exception: Can't recover from previous error(s)
at weka.knowledgeflow.steps.Loader.start(Loader.java:178)
at weka.knowledgeflow.StepManagerImpl.startStep(StepManagerImpl.java:1035)
at weka.knowledgeflow.BaseExecutionEnvironment$3.run(BaseExecutionEnvironment.java:440)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.lang.Exception: Can't recover from previous error(s)
at weka.core.converters.JSONLoader.getStructure(JSONLoader.java:242)
at weka.core.converters.JSONLoader.getDataSet(JSONLoader.java:267)
at weka.knowledgeflow.steps.Loader.start(Loader.java:172)
... 7 more
Caused by: java.lang.Exception: Can't recover from previous error(s)
at java_cup.runtime.lr_parser.report_fatal_error(lr_parser.java:392)
at java_cup.runtime.lr_parser.unrecovered_syntax_error(lr_parser.java:539)
at java_cup.runtime.lr_parser.parse(lr_parser.java:731)
at weka.core.json.JSONNode.read(JSONNode.java:634)
at weka.core.converters.JSONLoader.getStructure(JSONLoader.java:234)
... 9 more
11:16:12: [Low] JSONLoader$17081058|Interrupted
我的管道:
答案 0 :(得分:0)
无论如何,我只是编写了一个脚本来将我的数据(json)转换为arff。不确定用于确定文本数据属性的约定。我只是使用了我关心的推文类别中最常见的40个单词。我在最后添加了一个名为class的属性,它就像一个枚举,似乎是训练模型的惯例。
在SO https://github.com/jtara1/GameMediaBot/blob/master/transform_to_arff.py
上查看github上的代码或相同的代码import re
import json
from os.path import join, dirname, abspath, basename
from collections import Counter, OrderedDict
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import arff
import click
@click.command()
@click.argument('file')
@click.option('--dont-care-category',
type=click.STRING,
default='dont_care')
@click.option('-a',
type=click.INT,
default=40,
help='Number of attributes. Attrs are the most frequent words '
'in the text of the target category')
def transform(file, dont_care_category, a):
"""input example
[ {id: 123, text: "this is text body", category: ["dont_care"]} ]
output example
@relation game_media_bot
@attribute
:return:
"""
classes = set()
data = json.load(open(file, 'r'))
master_vector = Counter()
for tweet in data:
classes.add(tweet['category'][0])
if tweet['category'][0] != dont_care_category:
master_vector += get_word_vector(tweet)
print(master_vector)
# most common words in the text of the target category
attrs = [(word, 'INTEGER') for word, _ in master_vector.most_common(a)]
attrs.append(('class', [value for value in classes]))
arff_data = {
'attributes': attrs,
'data': [],
'description': '',
'relation': '{}'.format(dont_care_category)
}
for tweet in data:
word_vector = get_word_vector(tweet)
tweet_data = [word_vector[attr[0]] for attr in attrs[:-1]]
tweet_data.append(tweet['category'][0])
arff_data['data'].append(tweet_data)
out_file = file.replace('.json', '.arff')
data = arff.dumps(arff_data)
with open(out_file, 'w') as f:
f.write(data)
def get_word_vector(tweet):
stop_words = stopwords.words('english')
stop_words += ['!', ':', ',', '-', 'https', '/', '\u2026', "'s", "n't",
'#', '.', ';', ')', '(', "'re", '&', '?', '%', '@', "'",
'...']
uri = re.compile(r'(https)?:?//t\.co/.*')
# remove whitespace characters and put each word in a list
words = word_tokenize(tweet['text'])
# make each word lowercase
words = [word.lower() for word in words]
words = list(
filter(
lambda word: word not in stop_words and not uri.match(word),
words
)
)
return Counter(words)
if __name__ == '__main__':
transform()