我用逗号分隔的项目集格式的交易(篮子)数据
citrus fruit,semi-finished,bread,margarine
tropical fruit,yogurt,coffee,milk
yogurt,cream,cheese,meat spreads
etc
其中每行表示在单个交易中购买的商品。 通过使用Read.CSV运算符,我在RapidMiner中加载了此文件。我找不到任何运算符来转换此数据以进行FP增长和关联规则挖掘。
有没有办法在RapidMiner中读取此类文件以进行关联规则挖掘?
答案 0 :(得分:0)
我终于理解了你的意思 - 对不起我很慢。这可以使用文本处理扩展中的运算符来完成。您必须从RapidMiner存储库安装它。一旦你有,你可以尝试这个过程。
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.0.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="7.0.000" expanded="true" height="68" name="Read CSV" width="90" x="246" y="85">
<parameter key="csv_file" value="C:\Temp\is.txt"/>
<parameter key="column_separators" value="\r\n"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="att1.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.0.000" expanded="true" height="82" name="Nominal to Text" width="90" x="380" y="85"/>
<operator activated="true" class="text:process_document_from_data" compatibility="7.0.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="85">
<parameter key="vector_creation" value="Term Occurrences"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.0.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
<parameter key="mode" value="specify characters"/>
<parameter key="characters" value=","/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
诀窍是使用Read CSV
读取原始文件,但使用行尾作为分隔符。这将整行读作多项式属性。从那里,您必须将其转换为文本,以便文本处理操作员可以完成他们的工作。然后使用Process Documents from Data
运算符来设置最终的示例。重要的是使用Tokenize
运算符将行拆分为以逗号分隔的单词。