从python中的文件行中获取url

时间:2014-04-24 23:35:46

标签: python regex python-2.7 python-3.x

这是一个文件的一行,我只想在 uri 之后的网址和 smallPictureUrl 之后的网址稍后使用它但我找不到一个正确的方式

星号代表文字或数字或两者并列在一起,每一行看起来都是不同的,因此它们无法提供帮助,没有一种模式可以利用它

{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg", 
"timelineCoverPhoto":"{\"focus\":{\"x\":0.5,\"y\":0.49137931034483},\"photo\":{\"__type__

\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg
\",\"width\":180,\"height\":135}}}",
    "subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",

更简单的事情:

{"displayName":"Jim Test","firstName":"*","lastName":"*"} 

我设法在displayName之后使用re.search('(?<="displayName":")(\w+) (\w+)',line)取名为Jim Test,但如果你可以给我任何指示或建议,那么另一个非常复杂。

一条线就像这样

{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s200x200/*_*_*_*.jpg","timelineCoverPhoto":"{\"focus\":{\"x\":0.5,\"y\":0.40652557319224},\"photo\":{\"__type__\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-photos-h-a.akamaihd.net/hphotos-ak-prn2/*_*_*_a.jpg\",\"width\":180,\"height\":120}}}","subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s100x100/*_*_*_a.jpg","contactId":"**==","contactType":"USER","friendshipStatus":"ARE_FRIENDS","graphApiWriteId":"contact_*:*:*","hugePictureUrl":"https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/*.*.*.*/s720x720/*_*_*_*.jpg","profileFbid":"*","isMobilePushable":"NO","lookupKey":null,"name":{"displayName":"* *","firstName":"*","lastName":"*"},"nameSearchTokens":["*","*"],"phones":[],"phoneticName":{"displayName":null,"firstName":null,"lastName":null},"isMemorialized":false,"communicationRank":0.4183731,"canViewerSendGift":false,"canMessage":true}

3 个答案:

答案 0 :(得分:2)

timelineCoverPhoto相关联的值似乎是字符串化JSON,所以你可以做一些像这样丑陋的事情:

import json 
s = {
        "subscribeStatus": "IS_SUBSCRIBED",
        "bigPictureUrl": "https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",
        "timelineCoverPhoto": "{\"focus\":{\"x\":0.5,\"y\":0.49137931034483},\"photo\":{\"__type__\":{\"name\":\"Photo\"},\"image_lowres\":{\"uri\":\"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg \",\"width\":180,\"height\":135}}}",
        "smallPictureUrl": "https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg" 
    } 

j = json.loads(s.get('timelineCoverPhoto')) 
print "uri:", j.get('photo').get('image_lowres').get('uri')

uri: https://fbcdn-*-*-*.*.*/*-*-*/*.jpg 

答案 1 :(得分:2)

#See: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
import re, urllib

GRUBER_URLINTEXT_PAT = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')

for line in urllib.urlopen("http://daringfireball.net/misc/2010/07/url-matching-regex-test-data.text"):
print [ mgroups[0] for mgroups in GRUBER_URLINTEXT_PAT.findall(line) ]

答案 2 :(得分:1)

如果你不能使用json,那怎么样?

>>> print mytext

{"bigPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg", 
"timelineCoverPhoto":"{"focus":{"x":0.5,"y":0.49137931034483},"photo":{"__type__

":{"name":"Photo"},"image_lowres":{"uri":"https://fbcdn-*-*-*.*.*/*-*-*/*.jpg
","width":180,"height":135}}}",
    "subscribeStatus":"IS_SUBSCRIBED","smallPictureUrl":"https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg",

>>> uri = re.findall(r'uri\"\:\"[\'"]?([^\'" >]+)', mytext) #gets the uri
>>> smallpicurl = re.findall(r'smallPictureUrl\"\:\"[\'"]?([^\'" >]+)', mytext) # gets the smallPictureUrl
>>> ''.join(uri).rstrip()
'https://fbcdn-*-*-*.*.*/*-*-*/*.jpg' # uri
>>> ''.join(smallpicurl).rstrip()
'https://fbcdn-profile-a.akamaihd.net/*-*-*/*.*.*.*/*/*.jpg' # smallPictureUrl