Question

我正在努力争取多个网站的利率。数据是非常非结构化的，但形式上足够接近。我要捕获的内容：

x.xx％至xx.xx％

数据外观示例：

WebBank（FDIC成员）提供的所有贷款。您的实际利率取决于信用评分，贷款金额，贷款期限以及信用使用和历史记录。 实际年利率范围为5.98％至35.89％。例如，您可以得到6,000美元的贷款，利率为7.99％，5.00％的启动费为300美元，APR为11.51％。在本例中，您将收到$ 5,700，并且每月付款36次，共$ 187.99。应付总金额为$ 6,767.64。您的年利率将根据您申请时的信用额度来确定。截至2017年第一季度，初始费用为1％至6％，平均初始费用为5.49％。没有预付款，也永远不会收取预付款罚款。贷款的结清取决于您是否同意所有必需的协议以及在www.lendingclub.com网站上披露的信息。通过LendingClub进行的所有贷款的最低还款期为36个月或更长时间。

3.09％– 14.24％*

固定费率： APR的6.99％至24.99％ 锁定您的价格。您的每月付款永远不会改变。

我已经加粗了想要捕捉的内容。我当前的正则表达式如下：

(re.findall('(?i)(\d\.\d\d% (?:to|-) \d\d\.\d\d%)

实际报价如下：

plcompetitors = ['https://www.lendingclub.com/loans/personal-loans',
                'https://www.marcus.com/us/en/personal-loans',
                'https://www.discover.com/personal-loans/',
                'https://www.lightstream.com/',
                'https://www.prosper.com/']

#cycle through links in array until it finds APR rates/fixed or variable using regex
for link in plcompetitors:
    cdate = datetime.date.today()
    l = r.get(link)
    l.encoding = 'utf-8'
    data = l.text
    soup = bs(data, 'html.parser')
    paragraph = soup.find_all(text=re.compile('[0-9]%'))
    for n in paragraph:
        matches = []
        matches.extend(re.findall('(?i)(\d\.\d\d% (?:to|-) \d\d\.\d\d%)', n.string))
        matches.append(cdate.isoformat())
        matches.append(link)
        print(matches)
    paragraph.append(cdate.isoformat())
    paragraph.append(link)

新输出：

['5.98% to 35.89%', '2018-06-22', 'https://www.lendingclub.com/loans/personal-loans']
['2018-06-22', 'https://www.lendingclub.com/loans/personal-loans']
['6.99% to 24.99%', '6.99% to 24.99%', '6.99% to 24.99%', '6.99% to 24.99%', '2018-06-22', 'https://www.marcus.com/us/en/personal-loans']
['2018-06-22', 'https://www.marcus.com/us/en/personal-loans']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['6.99% to 24.99%', '2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.discover.com/personal-loans/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.lightstream.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']
['2018-06-22', 'https://www.prosper.com/']

Answer 1

编辑：根据您的评论 在Python3中运行以下命令，默认情况下应以ASCII处理示例字符串：

输入

import re

input = '''All loans made by WebBank, Member FDIC. Your actual rate depends upon credit score, loan amount, loan term, and credit usage & history. The APR ranges from 5.98% to 35.89%. For example, you could receive a loan of $6,000 with an interest rate of 7.99% and a 5.00% origination fee of $300 for an APR of 11.51%. In this example, you will receive $5,700 and will make 36 monthly payments of $187.99. The total amount repayable will be $6,767.64. Your APR will be determined based on your credit at time of application. The origination fee ranges from 1% to 6% and the average origination fee is 5.49% as of Q1 2017. There is no down payment and there is never a prepayment penalty. Closing of your loan is contingent upon your agreement of all the required agreements and disclosures on the www.lendingclub.com website. All loans via LendingClub have a minimum repayment term of 36 months or longer.

3.09% – 14.24%*

Fixed rates: 6.99% to 24.99% APR Lock in your rate. Your monthly payment will never change.'''
#Non-specific regex (I'm cheating)
output = re.findall('[\d]{1,3}\.[\d]+%[\S\s]{0,5}[\d]{1,3}\.[\d]+%', input)
print('output:')
print(output)

#More specific -- you can edit this in several ways
output_1 = re.findall('[\d]{1,3}\.[\d]+%[to\-\s]+[\d]{1,3}\.[\d]+%', input)
print('\noutput_1:')
print(output_1)

#What you need if you copy+paste from Stack into Python2.7.X
output_2 = re.findall('[\d]{1,3}\.[\d]+%[\s]*[to|\-|\xe2\x80\x93]+[\s]*[\d]{1,3}\.[\d]+%', input)
print('\noutput_2 (Python2.X):')
print(output_2)

输出

output:
['5.98% to 35.89%', '3.09% - 14.24%', '6.99% to 24.99%']

output_1:
['5.98% to 35.89%', '3.09% - 14.24%', '6.99% to 24.99%']

output_2 (Python2.X)::
['5.98% to 35.89%', '3.09% \xe2\x80\x93 14.24%', '6.99% to 24.99%']

Answer 2

paragraph = soup.find_all(text=re.compile('(?i)(\d\.\d\d% (?:to|-) \d\d\.\d\d%)'))行获取所有节点，这些节点的值与您的模式匹配。您实际上需要提取这些段落中的匹配项。

使用类似

matches=[]
for n in paragraph:
    matches.extend(re.findall(pattern, n.string))

关于图案本身，您可以使用

(?i)\d+(?:\.\d+)?%\s*(?:to|-)\s*\d+(?:\.\d+)?%

请参见regex demo。详细信息：

(?i)-不区分大小写的处理处于启用状态
\d+(?:\.\d+)?-1个以上的数字，后跟.和1个以上的数字
%-一个%符号
\s*-超过0个空格
(?:to|-)-to或-
\s*\d+(?:\.\d+)?%-参见上文（简而言之，空格，一个int或float值，后跟%）。

正则表达式捕获到特定的百分比/小数

2 个答案: