如何使用Pyspark在字符串中搜索子字符串

时间:2017-02-05 07:38:29

标签: string substring pyspark spark-dataframe

添加的图片包含an input and desired output的样本。

例如,如果句子包含" John"和"驱动"这意味着约翰有一辆车并开车去上班。我附上了我用来做代码的代码。但是,代码无法正常工作且过于复杂。我将非常感谢你的帮助。

[Authorize(Roles="whateverrole")]

1 个答案:

答案 0 :(得分:0)

我这样做:

import socket

class SparkUtil(object):
    @staticmethod
    def get_spark_context (host, venv, framework_name, parts):
        os.environ['PYSPARK_PYTHON'] = "{0}/bin/python".format (venv)
        from pyspark import SparkConf, SparkContext
        from StringIO import StringIO
        ip = socket.gethostbyname(socket.gethostname())
        sparkConf = (SparkConf()
                     .setMaster(host)
                     .setAppName(framework_name))
        return SparkContext(conf = sparkConf)

input_txt = [
    [ "John", "John usually drives to work. He usually gets up early and drinks coffee. Mary usually joining him." ],
    [ "Sam",  "As opposed to John, Sam doesn't like to drive. Sam usually walks there." ],
    [ "Mary", "Mary doesn't have driving license. Mary usually coming with John which picks her up from home." ]
]

def has_car (text):
    return "drives" in text

def get_method (text):
    method = None
    for m in [ "drives", "walks", "coming with" ]:
        if m in text:
            method = m
            break
    return method

def process_row (row):
    return [ row[0], has_car(row[1]), get_method(row[1]) ]

sc = SparkUtil.get_spark_context (host           = "local[2]",
                                  venv           = "../starshome/venv",
                                  framework_name = "app",
                                  parts          = 2)

print (sc.parallelize (input_txt).map (process_row).collect ())

您可以忽略的SparkUtil类。我没有使用笔记本电脑。这只是一个直接的Spark应用程序。