试图用Python来区分数字

时间:2017-12-27 20:43:50

标签: python regex

我很确定这个问题的答案在于正则表达式,但我说得不对。

我有一份文字清单。在那篇文章中,我正在分析出一堆不同的东西,但我特别想知道彼此之间的某些数字。我有顺序但有差距的客户编号(1,2,4,5 ... 1900,1901,1905),我有年份数字(2001年,2015年,2016年),最后我有账单金额(0.00,43.24, 1,925.00,10,324.95)。

我需要能够告诉结算金额,其中包括逗号,当金额为1,000美元或更多时,并且始终在小数点右侧包含两个数字,从年份数字开始,不使用任何逗号并且不要t有一个小数点。我可以通过它们所在的位置检测客户编号。

我一直试图通过测试正则表达式来实现:

currency_matcher = re.compile('\d*[,]*\d*\d[.]\d*')
if currency_matcher.search(line) != None:
    #assume currency

我还没有特别尝试寻找这些年,但我认为我会做类似的事情:

year_matcher = re.compile('\d\d\d\d')

我也考虑过使用try和except,但我刚开始进行实验,因为我输入了这个。

感谢任何帮助或建议。

编辑:试图增加一些清晰度 我有一个文本文档,它被分成如下的python列表:

[
  "1", "Alice Alwen", "12345 Oak Street", "Anytown, US 12345", "0.00", "123.45", "2007", "Pontiac", 
  "2", "Bob Bobberson", "1919 Elm Road", "Metropolis, US 11111", "123.45", "0.00", "2016", "Sherman Tank", "2105", "Bradley Fighting Vehicle", "5", 
  "Carl Carlson", "9854 Willow Way #1", "Gotham City, GA 34567", "1,001.00", "2,300.00", "2015", "Batmobile - used"
]

我总是可以告诉客户端ID是什么时候,因为这是第一件事。在个人信息之后,我的示例中有两个“货币”(我的数据集中有四个但是相同的想法)。我希望能够识别这些并提取它们。在行的末尾,您还会看到与车辆相关的年份编号。我不需要这些,但我需要确保在抓住货币时不小心抓住它们。

我已经(通过代码)记下了每个客户在我的数据集中的位置,所以我可以做类似的事情:

for cust in customers:
    currency_list = []
    for line in lines[begin_line : end_line]:
        if {magical regex here}:
            currency_list.append(line)
    {pseudo code to extract currency into my DataFrame}

希望这更有帮助,如果没有,我很乐意添加更多。

编辑2:实际代码 我想,虽然我在这里,但这是我写的,这是错误的,但它也可能在逻辑上是不准确的:

life_total_dict = {}
currency_matcher = re.compile('\d*[,]*\d*\d[.]\d*')
for index, row in customers.itterrows():
    start_row = row["Line Number"]
    end_row = row["End Line Number"]

    currency_counter = 0
    for line in workable_lines[range(start_row, end_row)]:
        #if re.search(currency_matcher,line) != None:
        if currency_matcher.search(line) != None:
            if currency_counter == 1:
                life_total_dict[index] = line.strip()
                currency_counter += 1
            else:
                currency_counter += 1
print(life_total_dict)
customer = customer.append(life_total_dict, ignore_index=True)

如果你想知道这个奇怪的计数器,我实际上只需要第二个货币金额。第一,第三和第四对我来说只是噪音。

3 个答案:

答案 0 :(得分:2)

使用regex即可完成此操作

currency_matcher = re.compile("^(\d+,)*\d+\.(\d{2})$")
...
if currency_matcher.search(line.strip()) != None:
    pass

答案 1 :(得分:2)

假设您只是要求正则表达式(看起来代码的其余部分是好的),您可以在下面看到两个正则表达式用于您的示例数据集。

注意Year的结果还包含id的结果。 OP指定他们有逻辑来区分两个,因此,我觉得没有必要在我的答案中添加逻辑。

代码

\d+(,\d+)*\.\d+$       # Currency
\d{4}$                 # Year

用法

See code in use here

import re

array = ["1", "Alice Alwen", "12345 Oak Street", "Anytown, US 12345", "0.00", "123.45", "2007", "Pontiac", "2", "Bob Bobberson", "1919 Elm Road", "Metropolis, US 11111", "123.45", "0.00", "2016", "Sherman Tank", "2105", "Bradley Fighting Vehicle", "5", "Carl Carlson", "9854 Willow Way #1", "Gotham City, GA 34567", "1,001.00", "2,300.00", "2015", "Batmobile - used"]
r1 = r"\d+(,\d+)*\.\d+$"
r2 = r"\d{4}$"

for s in array:
    if re.match(r1, s):
        print "Currency: " + s
    if re.match(r2, s):
        print "Year: " + s

说明

货币

  • re.match():匹配时在字符串(^)的开头处断言位置。
  • \d+匹配一个或多个数字
  • (,\d+)*符合以下任意次数
    • ,字面匹配逗号字符,
    • \d+匹配一个或多个数字
  • \.字面匹配点字符.
  • \d+制作一个或多个数字
  • $断言行尾的位置

  • re.match():匹配时在字符串(^)的开头处断言位置。
  • \d{4}恰好匹配任意数字4次
  • $断言行尾的位置

答案 2 :(得分:0)

避免正则表达式的一种方法是使用类型转换。它假定SELECT DISTINCT T0."DocNum" AS "QUOT DocNum", T0."CANCELED" AS "QUOT Canc", T3."DocNum" AS "OC DocNum", T3."CANCELED" AS "OC Canc", T5."DocNum" AS "RECP DocNum", T5."CANCELED" AS "RECP Canc", T0."DocDate" AS "Env Date", T0."ReqDate" AS "Lev Date", T3."DocDate" AS "OC Date", T5."DocDate" AS "Recp Date", T1."LineNum", CONCAT(REPLACE(T1."ItemCode",'.','_'),'_') AS SKU, T1."ItemCode", T1."WhsCode", T2."LineNum", CONCAT(REPLACE(T2."ItemCode",'.','_'),'_') AS SKU, T2."ItemCode", T2."WhsCode", T1."PQTReqQty" AS "Qtt_Pedida", T1."Quantity" AS "Qtt_Env", T2."Quantity" AS "Qtt_OC", T4."Quantity" AS "Qtt_Rec", CASE WHEN (CASE WHEN ((WEEKDAY(ADD_DAYS(T0."ReqDate",2))=5) OR (WEEKDAY(ADD_DAYS(T0."ReqDate",2))=6)) THEN ADD_DAYS(T0."ReqDate",4) ELSE ADD_DAYS(T0."ReqDate",2) END)>=CURRENT_DATE THEN 'TRUE' ELSE 'FALSE' END AS "Aft_Tdy", CASE WHEN (T0."CANCELED"='N' OR T0."CANCELED" IS NULL) AND (T3."CANCELED"='N' OR T3."CANCELED" IS NULL) AND (T5."CANCELED" IS NULL OR T5."CANCELED"='N') THEN 'TRUE' ELSE 'FALSE' END AS "Consid" FROM OPQT T0 LEFT JOIN PQT1 T1 ON T0."DocEntry" = T1."DocEntry" LEFT JOIN POR1 T2 ON T1."TrgetEntry"= T2."DocEntry" AND T1."LineNum"=T2."BaseLine" LEFT JOIN OPOR T3 ON T2."DocEntry" = T3."DocEntry" LEFT JOIN PDN1 T4 ON T2."LineNum"=T4."BaseLine" AND T2."DocEntry"=T4."BaseEntry" LEFT JOIN OPDN T5 ON T4."DocEntry" = T5."DocEntry" WHERE (T1."WhsCode"='03' OR T1."WhsCode"='33') UNION ALL SELECT DISTINCT T0."DocNum" AS "QUOT DocNum", T0."CANCELED" AS "QUOT Canc", T3."DocNum" AS "OC DocNum", T3."CANCELED" AS "OC Canc", T5."DocNum" AS "RECP DocNum", T5."CANCELED" AS "RECP Canc", T0."DocDate" AS "Env Date", T0."ReqDate" AS "Lev Date", T3."DocDate" AS "OC Date", T5."DocDate" AS "Recp Date", T1."LineNum", CONCAT(REPLACE(T1."ItemCode",'.','_'),'_') AS SKU, T1."ItemCode", T1."WhsCode", T2."LineNum", CONCAT(REPLACE(T2."ItemCode",'.','_'),'_') AS SKU, T2."ItemCode", T2."WhsCode", T1."PQTReqQty" AS "Qtt_Pedida", T1."Quantity" AS "Qtt_Env", T2."Quantity" AS "Qtt_OC", T4."Quantity" AS "Qtt_Rec", CASE WHEN (CASE WHEN ((WEEKDAY(ADD_DAYS(T0."ReqDate",2))=5) OR (WEEKDAY(ADD_DAYS(T0."ReqDate",2))=6)) THEN ADD_DAYS(T0."ReqDate",4) ELSE ADD_DAYS(T0."ReqDate",2) END)>=CURRENT_DATE THEN 'TRUE' ELSE 'FALSE' END AS "Aft_Tdy", CASE WHEN (T0."CANCELED"='N' OR T0."CANCELED" IS NULL) AND (T3."CANCELED"='N' OR T3."CANCELED" IS NULL) AND (T5."CANCELED" IS NULL OR T5."CANCELED"='N') THEN 'TRUE' ELSE 'FALSE' END AS "Consid" FROM OPOR T3 LEFT JOIN POR1 T2 ON T2."DocEntry" = T3."DocEntry" LEFT JOIN PQT1 T1 ON T1."TrgetEntry"= T2."DocEntry" AND T1."LineNum"=T2."BaseLine" LEFT JOIN OPQT T0 ON T0."DocEntry" = T1."DocEntry" LEFT JOIN PDN1 T4 ON T2."LineNum"=T4."BaseLine" AND T2."DocEntry"=T4."BaseEntry" LEFT JOIN OPDN T5 ON T4."DocEntry" = T5."DocEntry" WHERE ( ( T2."WhsCode" = ( '03' ) OR T2."WhsCode" = ( '33' ) ) AND ( ( T1."WhsCode" <> ( '03' ) OR T1."WhsCode" <> ( '33' ) ) OR T0."DocNum" IS NULL ) ) ID值不重叠。

year

minyear = 2000 new_data = [] for x in data: try: float_val = float(x) int_val = int(float_val) if float_val == int_val: if int_val >= minyear: new_data.append((int_val, "year")) else: new_data.append((int_val, "id")) else: new_data.append((float_val, "amount")) except ValueError: new_data.append((x, "string")) 的输出:

new_data