如何正确使用pyspark的查找表

时间:2017-02-23 17:43:35

标签: python-3.x apache-spark dataframe pyspark

我用python写的spark作业如下:

from pyspark import SparkConf, SparkContext
import collections
import pandas as pd

#Initialize RDD:
df = pd.read_csv("/Users/luca/Desktop/unk",sep="\t")
conf = SparkConf().setMaster("local").setAppName("NLP-TAG")
sc = SparkContext(conf = conf)
unkWords =sc.broadcast(dict.fromkeys(set(df["word"]),0))

#Helper function:
def parseline(line):
    fields = line.split(',')
    #Handling UNK
    base = str(fields[0])
    tag = str(fields[1])
    if base in unkWords.value:
        base = "<UNK>"
    return((base, tag),1)

lines = sc.textFile("/Users/luca/Desktop/train-brown.txt") #Creating the RDD
results = lines.map(parseline).reduceByKey(lambda x,y: x+y).sortByKey().collect()

for result in results:
    print(str(result[0][0])+"\t"+str(result[0][1])+"\t"+str(result[1]))

在我的火花工作中,我想要一本包含一组单词的字典。因此,我使用以下语法创建广播词典:

unkWords =sc.broadcast(dict.fromkeys(set(df["word"]),0))

train-brown.txt中,我有字,标记,例如 旅店 世界,NN

在我的函数 parseline 中,我想将我的字典中存在的单词更改为UNK。但是,我收到如下错误:

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/luca/Apps/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/Users/luca/Apps/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/luca/Apps/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2407, in pipeline_func
  File "/Users/luca/Apps/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2407, in pipeline_func
  File "/Users/luca/Apps/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 346, in func
  File "/Users/luca/Apps/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1828, in combineLocally
  File "/Users/luca/Apps/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py", line 236, in mergeValues
    for k, v in iterator:
  File "/Users/luca/git/spark-python/src/basic/nlp-tag.py", line 17, in parseline
    tag = str(fields[1])
IndexError: list index out of range

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

如果我删除此行,则错误消失:

if base in unkWords.value:
    base = "<UNK>"

如何使用广播变量?

0 个答案:

没有答案