Question

我是pyspark的新手，想要将现有的pandas / python代码翻译为PySpark。

我希望将dataframe设置为子集，以便只返回包含我在'original_problem'字段中查找的特定关键字的行。

下面是我在PySpark中尝试过的Python代码：

def pilot_discrep(input_file):

    df = input_file 

    searchfor = ['cat', 'dog', 'frog', 'fleece']

    df = df[df['original_problem'].str.contains('|'.join(searchfor))]

    return df

当我尝试运行上述操作时，出现以下错误：

AnalysisException：u＆＃34;无法从original_problem＃207中提取值：需要结构类型但得到字符串;＆＃34;

Answer 1

在pyspark中，试试这个：

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<table id="parentTable">
  <thead>
    <tr>
      <th>Unique ID</th>
      <th>Name</th>
      <th>Email</th>
      <th>Price</th>
      <th>Hours</th>
    </tr>
  </thead>
  <tbody id="parentTableBody">
  </tbody>
</table>

<div id="subTableContainer" style="display: none;">
  <table>
    <thead>
      <tr>
        <th>Description</th>
        <th>Arrival</th>
      </tr>
    </thead>
    <tbody>
      <tr>

      </tr>
    </tbody>
  </table>
</div>

或等效地：

df = df[df['original_problem'].rlike('|'.join(searchfor))]

或者，您可以选择import pyspark.sql.functions as F df.where(F.col('original_problem').rlike('|'.join(searchfor)))：

udf

但DataFrame方法是首选，因为它们会更快。

PySpark：搜索文本和子集数据帧中的子字符串

1 个答案: