使用RegEx

时间:2017-06-24 23:22:08

标签: python regex string

我有以下字符串

{"$deletedFields":["day"],"month":8,"year":2003,"$type":"com.linkedin.common.Date","$id":"urn:li:fs_position:(ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768),timePeriod,startDate"},

我想要的是使用key在对开方向搜索以获取月份和年份。

key = 'ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768'

实际上我正在抓取并抓取文件中的数据,因此密钥是我唯一希望区分每个数据的。

我已经完成了正则表达,但我想向相反的方向进行搜索。 让我们说

re.findall(r''+key+'.*?),\$deletedFields', page_html)

如果它有一些否定或反对意见,那么它会抓取数据直到$ deletedFields

我不想使用reversed字符串来改变整个文件。

必需的输出

年:2003,月:8

2 个答案:

答案 0 :(得分:1)

修改
keys which have different order so i just want to search in opposite direction till the $deletedfield
重新阅读你的问题后,看起来你不知道在哪里 记录的开头是。

例如,如果你有一个明确的结束的一般开始,它没有 很好地指定一个共同的记录开始然后匹配任何东西直到中 键,这将从第一次开始一直到键,可能是 在这个过程中抓住其他钥匙。

但是,您仍然可以通过每次遇到时重置开始来向前搜索 一个新的。

这使用了无序和可选的日期部分。它还捕获了密钥
因为需要它。

另一个功能是,只需在交替中添加所有键,即可将所有键和日期包含并记录到记录数组中。

因此,正则表达式模型为$deletedfield + date parts + any of these keys
并确保我们不会同时通过记录边界。

(?s)"\$deletedFields":(?:"day":(?P<day>\d+),|"month":(?P<month>\d+),|"year":(?P<year>\d+),|(?!"\$deletedFields":).)*?(?P<key>ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768|BCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,264599768|CCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,364599768|DCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,464599768)

扩展

 (?s)                          # Dot-All modifier

 "\$deletedFields":            # Beginning of record
 (?:
      "day":
      (?P<day> \d+ )                # (1), day
      ,
   |                              # or,
      "month":
      (?P<month> \d+ )              # (2), month
      ,
   |                              # or,
      "year": 
      (?P<year> \d+ )               # (3), year
      , 
   |                              # or,
      (?! "\$deletedFields": )      # any character, but not the beginning of record
      .     
 )*?

 (?P<key>                      # (4 start), Keys to find
      ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768
   |  BCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,264599768
   |  CCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,364599768
   |  DCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,464599768
 )                             # (4 end)

的Python
http://rextester.com/XXH80293

import re

str = (
  r'{"$deletedFields":"month":2,"year":2003,"$type":"com.linkedin.common.Date","$id":"urn:li:fs_position:(ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768),timePeriod,startDate"},' + "\n"
  r'{"$deletedFields":"month":12,"year":2001,"$type":"com.linkedin.common.Date","$id":"urn:li:fs_position:(DCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,464599768),timePeriod,startDate"},' + "\n"
  r'{"$deletedFields":"month":6,"year":2012,"$type":"com.linkedin.common.Date","$id":"urn:li:fs_position:(BCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,264599768),timePeriod,startDate"},' + "\n"
  r'{"$deletedFields":"day":30,"month":8,"year":2009,"$type":"com.linkedin.common.Date","$id":"urn:li:fs_position:(CCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,364599768),timePeriod,startDate"},' + "\n"
)
keys = ['ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768',
        'BCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,264599768',
        'CCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,364599768',
        'DCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,464599768']

rx_keys = '(' + '|'.join( keys ) + ')'

Rx = r'(?s)"\$deletedFields":(?:"day":(?P<day>\d+),|"month":(?P<month>\d+),|"year":(?P<year>\d+),|(?!"\$deletedFields":).)*?' + rx_keys
key = 'ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768'

print re.findall( Rx, str)

输出

[('', '2', '2003', 'ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768'), ('', '12', '2001', 'DCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,464599768'), ('', '6', '2012', 'BCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,264599768'), ('30', '8', '2009', 'CCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,364599768')]

答案 1 :(得分:0)

我不认为你需要一个正则表达式,因为你知道关键。 (但是,一旦识别出记录,您可能希望使用正则表达式来解析记录。)您可以只搜索密钥,然后搜索开始和结束记录标记,如下所示:

one_line='''
{{"$deletedFields":["day"],"month":8,"year":2003,"$type":"com.linkedin.common.Date","$id":"urn:li:fs_position:(ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,1645997{}),timePeriod,startDate"}},
'''

buncha_records = ''.join(one_line.strip().format(n) for n in range(100) if n % 2)


def find_record(key, text):
    # Could raise!
    in_record = text.index(key)

    open_brace = text.rfind('{', 0, in_record)
    close_brace = text.find('}', in_record)

    return text[open_brace:close_brace+1]

import random

try:
    n = random.randrange(100)
    random_key = "ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,1645997{}".format(n)

    print("Searching for key:", random_key)
    record = find_record(random_key, buncha_records)

    print("Got record:")
    print(record)
except IndexError:
    print("Key '{}' was not found in records.".format(random_key))