我有以下字符串
{"$deletedFields":["day"],"month":8,"year":2003,"$type":"com.linkedin.common.Date","$id":"urn:li:fs_position:(ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768),timePeriod,startDate"},
我想要的是使用key
在对开方向搜索以获取月份和年份。
key = 'ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768'
实际上我正在抓取并抓取文件中的数据,因此密钥是我唯一希望区分每个数据的。
我已经完成了正则表达,但我想向相反的方向进行搜索。 让我们说
re.findall(r''+key+'.*?),\$deletedFields', page_html)
如果它有一些否定或反对意见,那么它会抓取数据直到$ deletedFields
我不想使用reversed
字符串来改变整个文件。
必需的输出
年:2003,月:8
答案 0 :(得分:1)
修改强>
keys which have different order so i just want to search in opposite direction till the $deletedfield
重新阅读你的问题后,看起来你不知道在哪里
记录的开头是。
例如,如果你有一个明确的结束的一般开始,它没有 很好地指定一个共同的记录开始然后匹配任何东西直到中 键,这将从第一次开始一直到键,可能是 在这个过程中抓住其他钥匙。
但是,您仍然可以通过每次遇到时重置开始来向前搜索 一个新的。
这使用了无序和可选的日期部分。它还捕获了密钥
因为需要它。
另一个功能是,只需在交替中添加所有键,即可将所有键和日期包含并记录到记录数组中。
因此,正则表达式模型为$deletedfield
+ date parts
+ any of these keys
。
并确保我们不会同时通过记录边界。
(?s)"\$deletedFields":(?:"day":(?P<day>\d+),|"month":(?P<month>\d+),|"year":(?P<year>\d+),|(?!"\$deletedFields":).)*?(?P<key>ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768|BCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,264599768|CCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,364599768|DCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,464599768)
扩展
(?s) # Dot-All modifier
"\$deletedFields": # Beginning of record
(?:
"day":
(?P<day> \d+ ) # (1), day
,
| # or,
"month":
(?P<month> \d+ ) # (2), month
,
| # or,
"year":
(?P<year> \d+ ) # (3), year
,
| # or,
(?! "\$deletedFields": ) # any character, but not the beginning of record
.
)*?
(?P<key> # (4 start), Keys to find
ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768
| BCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,264599768
| CCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,364599768
| DCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,464599768
) # (4 end)
的Python
http://rextester.com/XXH80293
import re
str = (
r'{"$deletedFields":"month":2,"year":2003,"$type":"com.linkedin.common.Date","$id":"urn:li:fs_position:(ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768),timePeriod,startDate"},' + "\n"
r'{"$deletedFields":"month":12,"year":2001,"$type":"com.linkedin.common.Date","$id":"urn:li:fs_position:(DCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,464599768),timePeriod,startDate"},' + "\n"
r'{"$deletedFields":"month":6,"year":2012,"$type":"com.linkedin.common.Date","$id":"urn:li:fs_position:(BCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,264599768),timePeriod,startDate"},' + "\n"
r'{"$deletedFields":"day":30,"month":8,"year":2009,"$type":"com.linkedin.common.Date","$id":"urn:li:fs_position:(CCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,364599768),timePeriod,startDate"},' + "\n"
)
keys = ['ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768',
'BCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,264599768',
'CCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,364599768',
'DCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,464599768']
rx_keys = '(' + '|'.join( keys ) + ')'
Rx = r'(?s)"\$deletedFields":(?:"day":(?P<day>\d+),|"month":(?P<month>\d+),|"year":(?P<year>\d+),|(?!"\$deletedFields":).)*?' + rx_keys
key = 'ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768'
print re.findall( Rx, str)
输出
[('', '2', '2003', 'ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,164599768'), ('', '12', '2001', 'DCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,464599768'), ('', '6', '2012', 'BCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,264599768'), ('30', '8', '2009', 'CCoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,364599768')]
答案 1 :(得分:0)
我不认为你需要一个正则表达式,因为你知道关键。 (但是,一旦识别出记录,您可能希望使用正则表达式来解析记录。)您可以只搜索密钥,然后搜索开始和结束记录标记,如下所示:
one_line='''
{{"$deletedFields":["day"],"month":8,"year":2003,"$type":"com.linkedin.common.Date","$id":"urn:li:fs_position:(ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,1645997{}),timePeriod,startDate"}},
'''
buncha_records = ''.join(one_line.strip().format(n) for n in range(100) if n % 2)
def find_record(key, text):
# Could raise!
in_record = text.index(key)
open_brace = text.rfind('{', 0, in_record)
close_brace = text.find('}', in_record)
return text[open_brace:close_brace+1]
import random
try:
n = random.randrange(100)
random_key = "ACoAAAGiKv0BjXc8aE9HZLXpUnNcxQD4CoB1mKg,1645997{}".format(n)
print("Searching for key:", random_key)
record = find_record(random_key, buncha_records)
print("Got record:")
print(record)
except IndexError:
print("Key '{}' was not found in records.".format(random_key))