Question

我需要在Python中为Pig数据转换作业编写一些用户定义的函数。为了描述这种情况，数据正在被解析和提供，Pig脚本将基本上为列中的每个数据字段调用这个Python UDF。

大多数UDF本质上是相似的，我需要将字符串与'某事物+通配符'基本匹配。我知道regex并且到目前为止已经使用过它，但在我进一步讨论之前，我想确保这是一种匹配字符串的有效方法，因为脚本将迭代并调用UDF数千次。 / p>

示例：说我们有一个字段，我们需要匹配sales。此字段的可能值可能是任何值，因为源数据将来可能会变得很糟糕，并随机附加一些内容并吐出saleslol。其他可能的值包括sales.，salessales，sales.yes。

“销售”无关紧要;如果它以sales开头，那么我想抓住它。

以下方法有效吗？ word变量是sales列中的输入或值。第一行是Pig脚本

@outputSchema("num:int")
def rule2(word):
  sales_match = re.match('sales', word, flags=re.IGNORECASE)

  if sales_match:
    return 1
  else:
    return 0

2

我有另一个场景，我需要匹配4个确切的已知字符串。这也有效吗？

@outputSchema("num:int")
def session1(word):
  if word in ['first', 'second', 'third', 'fourth']:
    return 1
  else:
    return 0

Answer 1

您可以使用str.startswith()：

>>> [s for s in 'saleslol. Other possible values are sales. salessales sales.yes'.split() if s
.lower().startswith('sales')]
['saleslol.', 'sales.', 'salessales', 'sales.yes']

您也不需要在Python中执行此操作：

if word in ['first', 'second', 'third', 'fourth']:
    return 1
else:
    return 0

相反，最好这样做：

def session1(word):
    return word in {'first', 'second', 'third', 'fourth'}

（注意set literal vs list，但列表的语法相同）

对于测试前缀的形式，您的函数将是：

def f(word):
    return word.startswith('sales')    # returns True or False

如果您想测试几个可能的字词，请使用any：

>>> def test(tgt, words):
...    return any(word.startswith(tgt) for word in words)
>>> test('sales', {'boom', 'blast', 'saleslol'})
True
>>> test('boombang', {'sales', 'boom', 'blast'})
False

相反，如果你想测试几个前缀，请使用startwith的元组形式：

>>> 'tenthhaha'.startswith(('first', 'second', 'third', 'fourth'))
False
>>> 'firstlol'.startswith(('first', 'second', 'third', 'fourth'))
True

Answer 2

实际上，由于某种原因，功能A似乎更快，我在每个功能上做了100万个循环，如果我的测量结果是正确的话，第一个快20％


from pythonbenchmark import compare, measure

def session1_A(word):
  if word in ['first', 'second', 'third', 'fourth']:
    return 1
  else:
    return 0

def session1_B(word):
    return word in {'first', 'second', 'third', 'fourth'}

compare(session1_A, session1_B, 1000000, "fourth")

enter image description here

https://github.com/Karlheinzniebuhr/pythonbenchmark/

在Python中匹配字符串的最有效方法是什么？

2 个答案: