Question

我一直在尝试使用Python为自己构建一个简单的客户经理类应用程序，它将从我的手机读取短信并根据一些正则表达式模式提取信息。

我写了一个复杂的正则表达式模式并在https://pythex.org/上测试了相同的模式。例如：

Text: 1.00 is debited from ******1234  for food

Pattern: (account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)

Result: from ******1234

但是，当我尝试使用str.extract（）方法在Python中执行相同操作时，而不是获得单个结果，我得到的数据帧包含每个组的列。

Python代码如下所示：

all_sms=pd.read_csv("all_sms.csv")

pattern = '(account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)'

test = all_sms.extract(pattern, expand = False)

输出上面消息的python代码：

0           from
1               
2            NaN
3            NaN
4            NaN
5     ******1234
6           1234
7           1234
8               
9               
10

我是Python的新手并且尝试通过实践经验学习，如果有人可以指出我在哪里出错，那将会非常有用吗？

Answer 1

在深入了解你的正则表达式模式之前，你应该明白你使用它的原因大熊猫。熊猫适合数据分析（因此适合您的问题）但这里似乎有点矫枉过正。

如果你是初学者，我建议你坚持使用纯python而不是因为熊猫很复杂，但因为知道python标准库会帮助你从长远来看。如果你现在跳过基础知识，从长远来看这可能会对你造成伤害。

考虑到你要使用python3（没有pandas）我会继续遵循：

# Needed imports from standard library.
import csv
import re

# Declare the constants of my tiny program.
PATTERN = '(account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)'
COMPILED_REGEX = re.compile(PATTERN)

# This list will store the matched regex.
found_regexes = list()

# Do the necessary loading to enable searching for the regex.
with open('mysmspath.csv', newline='') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=' ', quotechar='"')
    # Iterate over rows in your csv file.
    for row in csv_reader:
        match = COMPILED_REGEX.search(row)
        if match:
            found_regexes.append(row)

print(found_regexes)

这不一定能解决你的复制粘贴问题，但这可能会给你一个问题想要更简单的解决问题的方法。

如何从Python中的正则表达式中只提取一个字符串？

1 个答案: