Question

我尝试使用我编写的类和udf来处理pyspark数据框中的URL。我知道urllib和其他url解析库但是在这种情况下我需要使用自己的代码。

为了得到一个url的tld，我将它与iana公共后缀列表进行交叉检查。

这是我的代码的简化

class Parser:

    # list of available public suffixes for extracting top level domains
    file = open("public_suffix_list.txt", 'r')

    data = []
    for line in file:
        if line.startswith("//") or line == '\n':
            pass
        else:
            data.append(line.strip('\n'))

    def __init__(self, url):
        self.url = url

        #the code here extracts port,protocol,query etc.

        #I think this bit below is causing the error
        matches = [r for r in self.data if r in self.hostname] 

        #extra functionality in my actual class

        i = matches.index(self.string)

        try:
            self.tld = matches[i]

        # logic to find tld if no match

该类在纯python中工作，例如我可以运行

import Parser

x = Parser("www.google.com")
x.tld #returns ".com"

然而，当我尝试

时

import Parser
from pyspark.sql.functions import udf

parse = udf(lambda x: Parser(x).url)

df = sqlContext.table("tablename").select(parse("column"))

当我打电话给我时，

  File "<stdin>", line 3, in <lambda>
  File "<stdin>", line 27, in __init__
TypeError: 'in <string>' requires string as left operand

所以我的猜测是它没有将数据解释为字符串列表？

我也尝试过使用

file = sc.textFile("my_file.txt")\
         .filter(lambda x: not x.startswith("//") or != "")\
         .collect()

data = sc.broadcast(file)

打开我的文件，但这会导致

异常：您似乎正在尝试从广播变量，操作或转换引用SparkContext。 SparkContext只能在驱动程序上使用，而不能在工作程序上运行的代码中使用。有关更多信息，请参阅SPARK-5063。

有什么想法吗？

提前致谢

编辑：道歉，我没有提交我的代码，所以我的测试代码并没有很好地解释我遇到的问题。我最初报告的错误是我使用的测试数据的结果。

我已经更新了我的问题，以更好地反映我所面临的挑战。

Answer 1

为什么在这种情况下需要一个类（用于定义类的代码不正确，在 init 方法中使用它之前从未声明过self.data）影响该类的唯一相关行你想要的输出是self.string=string，所以你基本上将身份函数作为udf传递。

UnicodeDecodeError是由于文件中的编码问题引起的，它与您对类的定义无关。

第二个错误出现在第sc.broadcast(file)行中，详细信息可在此处找到：Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion

编辑1

我会按照以下方式重新定义您的类结构。您基本上需要通过调用self.data来创建实例self.data = data，然后才能使用它。此外，无论您是否调用该类，都会执行在init方法之前编写的任何内容。因此，移出文件解析部分将不会有任何影响。

# list of available public suffixes for extracting top level domains
file = open("public_suffix_list.txt", 'r')

data = []
for line in file:
    if line.startswith("//") or line == '\n':
        pass
    else:
        data.append(line.strip('\n'))

class Parser:
    def __init__(self, url):
        self.url = url
        self.data = data 

        #the code here extracts port,protocol,query etc.

        #I think this bit below is causing the error
        matches = [r for r in self.data if r in self.hostname] 

        #extra functionality in my actual class

        i = matches.index(self.string)

        try:
            self.tld = matches[i]

        # logic to find tld if no match

使用带有spark DataFrame的python类来解析URL

1 个答案: