colRegex在pyspark 3.0中返回错误-python 3.7

时间:2019-12-11 21:45:07

标签: python-3.x pyspark colregex

我有一个pyspark数据帧,其中包含一些后缀为select d.file_no, d.name, d.subject, r.requested_date, r.approved_date, i.issue_date from tbl_documents d join tbl_requests r on r.document_id = d.id join tbl_issues i on i.document_id = d.id 的列。

_24

我尝试使用colRegex方法选择它们,但是下面的代码导致异常:

df.columns = [timestamp',
 'air_temperature_median_24',
 'air_temperature_median_6',
 'wind_direction_mean_24',
 'wind_speed',
 'building_id']

pyspark可以正常运行,因此没有问题,因此这很可能是语法错误。

另一方面,此语法也会失败:

df.select(ashrae.colRegex(".+'_24'")).show()

    ---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-103-a8189f0298e6> in <module>
----> 1 ashrae.select(ashrae.colRegex(".+'_24'")).show()

C:\spark\spark-3.0.0-preview-bin-hadoop2.7\python\pyspark\sql\dataframe.py in colRegex(self, colName)
    957         if not isinstance(colName, basestring):
    958             raise ValueError("colName should be provided as string")
--> 959         jc = self._jdf.colRegex(colName)
    960         return Column(jc)
    961 

C:\spark\spark-3.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.8.1-src.zip\py4j\java_gateway.py in __call__(self, *args)
   1284         answer = self.gateway_client.send_command(command)
   1285         return_value = get_return_value(
-> 1286             answer, self.gateway_client, self.target_id, self.name)
   1287 
   1288         for temp_arg in temp_args:

C:\spark\spark-3.0.0-preview-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
     96     def deco(*a, **kw):
     97         try:
---> 98             return f(*a, **kw)
     99         except py4j.protocol.Py4JJavaError as e:
    100             converted = convert_exception(e.java_exception)

C:\spark\spark-3.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.8.1-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o151.colRegex.
: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.charAt(Unknown Source)
    at scala.collection.immutable.StringOps$.apply$extension(StringOps.scala:41)
    at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:202)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
    at org.apache.spark.sql.Dataset.resolve(Dataset.scala:259)
    at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1364)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Unknown Source)

什么原因导致异常以及如何纠正代码?

1 个答案:

答案 0 :(得分:2)

尝试以下语法:

renter

使用colRegex时,列名由反引号引起来。