Question

我收到此错误

        Using Python version 3.5.2+ (default, Sep 22 2016 12:18:14)
SparkSession available as 'spark'.
Traceback (most recent call last):
  File "/home/saria/PycharmProjects/TfidfLDA/main.py", line 30, in <module>
    corpus = indexed_data.select(col("KeyIndex",str).cast("long"), "features").map(list)
  File "/home/saria/tf27/lib/python3.5/site-packages/pyparsing.py", line 956, in col
    return 1 if 0<loc<len(s) and s[loc-1] == '\n' else loc - s.rfind("\n", 0, loc)
TypeError: unorderable types: int() < str()

Process finished with exit code 1

当我运行以下代码时。我应该解释一下错误发生在这一行：

corpus = indexed_data.select(col("KeyIndex",str).cast("long"), "features").map(list)

我回顾了这些案例：

enter link description here

但它们是关于转换int和字符串，特别是读取输入。但在这里我没有输入， 解释代码： 此代码使用Dataframe

执行tfidf + lda

    # I used alias to avoid confusion with the mllib library
from pyparsing import col
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import HashingTF as MLHashingTF, Tokenizer, HashingTF, IDF, StringIndexer
from pyspark.ml.feature import IDF as MLIDF
from pyspark.python.pyspark.shell import sqlContext, sc

from pyspark.sql.types import DoubleType, StructField, StringType, StructType
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row

dbURL = "hdfs://en.wikipedia.org/wiki/Music"
file = sc.textFile("1.txt")
#Define data frame schema
fields = [StructField('key',StringType(),False),StructField('content',StringType(),False)]
schema = StructType(fields)
#Data in format <key>,<listofwords>
file_temp = file.map(lambda l : l.split(","))
file_df = sqlContext.createDataFrame(file_temp, schema)
#Extract TF-IDF From https://spark.apache.org/docs/1.5.2/ml-features.html
tokenizer = Tokenizer(inputCol='content', outputCol='words')
wordsData = tokenizer.transform(file_df)
hashingTF = HashingTF(inputCol='words',outputCol='rawFeatures',numFeatures=1000)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol='rawFeatures',outputCol='features')
idfModel = idf.fit(featurizedData)
rescaled_data = idfModel.transform(featurizedData)
indexer = StringIndexer(inputCol='key',outputCol='KeyIndex')
indexed_data = indexer.fit(rescaled_data).transform(rescaled_data).drop('key').drop('content').drop('words').drop('rawFeatures')
corpus = indexed_data.select(col("KeyIndex",str).cast("long"), "features").map(list)
model = LDA.train(corpus, k=2)

请告诉我你的想法，

当我在容易出错的行中删除str时：

corpus = indexed_data.select(col("KeyIndex",str).cast("long"), "features").map(list)

，它会抛出一个新错误 TypeError: col() missing 1 required positional argument: 'strg'

更新 我的主要目标是运行此代码：

tfidf then lda

error TypeError：unorderable类型：int（）＆lt; STR（）

0 个答案: