如何使用python从文本文件中获取URL?

时间:2017-04-09 13:54:26

标签: python python-2.7 search text-files

我想从我拥有的文本文件中获取所有 hostPageDisplayUrl 。下面列出了几行

{"instrumentation": {"pageLoadPingUrl": "https://www.bingapis.com/api/ping/pageload?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&Type=Event.CPT&DATA=0"}, "_type": "Images", "displayRecipeSourcesBadges": true, "value": [{"contentUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=QWSSSaNP6OdarVmpdZ2TGGupNBCF0-Ue_w2zKVqczwk&v=1&r=http%3a%2f%2fphotos.wikimapia.org%2fp%2f00%2f02%2f91%2f36%2f73_big.jpg&p=DevEx,5008.1", "accentColor": "2B3C71", "height": 375, "hostPageDisplayUrl": "wikimapia.org/1649944/Bahawalpur-Railway-Station", "name": "Bahawalpur Railway Station - Bahawalpur (\u0628\u06c1\u0627\u0648\u0644\u067e\u0648\u0631)", "width": 500, "imageId": "5464C96913992D44983D02E302F166C57BC6DA26", "imageInsightsToken": "ccid_CUojXAsn*mid_5464C96913992D44983D02E302F166C57BC6DA26*simid_608054236795568956", "datePublished": "2010-02-21T22:19:00", "encodingFormat": "jpeg", "webSearchUrl": "https://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=Fbz9jxTPMT44aF3aWlDgNwU7Zhr3qYbOco653N9vnIc&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dBahawalpurRailwayStation%26id%3d5464C96913992D44983D02E302F166C57BC6DA26%26simid%3d608054236795568956&p=DevEx,5006.1", "hostPageUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=MVElDiTqkKkcRJKEQxgr1yxRbwh-DpMNfT7lA6g1ivg&v=1&r=http%3a%2f%2fwikimapia.org%2f1649944%2fBahawalpur-Railway-Station&p=DevEx,5007.1", "thumbnailUrl": "https://tse1.mm.bing.net/th?id=OIP.CUojXAsnV5KRBVF6-RIlLwEsDh&pid=Api", "thumbnail": {"width": 300, "height": 225}, "contentSize": "38571 B"}, {"contentUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=yrOFma0zG8eUzUVY0l7jt_KfBAXPyuTyuXa9jJjeFR0&v=1&r=http%3a%2f%2fstatic.panoramio.com%2fphotos%2flarge%2f84118355.jpg&p=DevEx,5014.1", "accentColor": "A36728", "height": 768, "hostPageDisplayUrl": "panoramio.com/photo/84118355", "name": "Panoramio - Photo of Bahawalpur railway station", "width": 1024, "imageId": "FE04EA82163F27DC0A8449CF2086E4DA4F359DF7", "imageInsightsToken": "ccid_1683LeSg*mid_FE04EA82163F27DC0A8449CF2086E4DA4F359DF7*simid_608010054465029867", "datePublished": "2013-01-01T12:00:00", "encodingFormat": "jpeg", "webSearchUrl": "https://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=0NEX-sC8BaLrZ9HDkSbA_7kztZ1BoVoihkkvnL2tGiQ&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dBahawalpurRailwayStation%26id%3dFE04EA82163F27DC0A8449CF2086E4DA4F359DF7%26simid%3d608010054465029867&p=DevEx,5012.1", "hostPageUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=l9wqPINQPoe9u5N_qiFUtBQ6PrxdwEPiwObrCwBTQ2U&v=1&r=http%3a%2f%2fpanoramio.com%2fphoto%2f84118355&p=DevEx,5013.1", "thumbnailUrl": "https://tse2.mm.bing.net/th?id=OIP.1683LeSgJHoFhxX-tKhGSAEsDh&pid=Api", "thumbnail": {"width": 300, "height": 225}, "contentSize": "125011 B"}, {"contentUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=1OS0LXGeQJbC9gOsRy00e-ae0535j7iNl4qiaNTTG0I&v=1&r=http%3a%2f%2fphotos.wikimapia.org%2fp%2f00%2f05%2f21%2f47%2f89_big.jpg&p=DevEx,5020.1", "accentColor": "5B4F36", "height": 361, "hostPageDisplayUrl": "wikimapia.org/1649944/Bahawalpur-Railway-Station", "name": "Bahawalpur Railway Station - Bahawalpur (\u0628\u06c1\u0627\u0648\u0644\u067e\u0648\u0631)", "width": 500, "imageId": "5464C96913992D44983D6D8CBD36CB6E679FEA3C", "imageInsightsToken": "ccid_JhLSwAc0*mid_5464C96913992D44983D6D8CBD36CB6E679FEA3C*simid_607998234704153808", "datePublished": "2016-12-09T20:58:00", "encodingFormat": "jpeg", "webSearchUrl": "https://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=IJTtTeRFNBA0xr1DyZcz6AMb43pJFV25m3WrDfLhQls&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dBahawalpurRailwayStation%26id%3d5464C96913992D44983D6D8CBD36CB6E679FEA3C%26simid%3d607998234704153808&p=DevEx,5018.1", "hostPageUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=MVElDiTqkKkcRJKEQxgr1yxRbwh-DpMNfT7lA6g1ivg&v=1&r=http%3a%2f%2fwikimapia.org%2f1649944%2fBahawalpur-Railway-Station&p=DevEx,5019.1", "thumbnailUrl": "https://tse1.mm.bing.net/th?id=OIP.JhLSwAc0HwFeWsHjAUYStgEsDY&pid=Api", "thumbnail": {"width": 300, "height": 216}, "contentSize": "28945 B"}, {"contentUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=t6oOsr-23sNP-TFFzn39BVuagjYmXknVGiIWYD_tJv0&v=1&r=http%3a%2f%2fnativepakistan.com%2fwp-content%2fuploads%2fPhoto-of-Bahawalpur-RailwayS-tation-Photos-of-Bahawalpur.jpg&p=DevEx,5026.1", "accentColor": "49418A", "height": 347, "hostPageDisplayUrl": "nativepakistan.com/photos-of-bahawalpur", "name": "Photo of Bahawalpur Railway Station - Photos of Bahawalpur", "width": 500, "imageId": "7A05E50C94144666BFEB7BEECE6FB3DFC3313E18", "imageInsightsToken": "ccid_wS0pep46*mid_7A05E50C94144666BFEB7BEECE6FB3DFC3313E18*simid_607992170213084482", "datePublished": "2012-09-21T23:07:00", "encodingFormat": "jpeg", "webSearchUrl": "https://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=2kFu0Xn07bcJKuZI03iY3Ihq99ZiKFOvd0PXvVWqt94&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dBahawalpurRailwayStation%26id%3d7A05E50C94144666BFEB7BEECE6FB3DFC3313E18%26simid%3d607992170213084482&p=DevEx,5024.1", "hostPageUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=ht8SkbUIRgMkFq4yXvbHpmsINok4VTcxu0FiwMayk9A&v=1&r=http%3a%2f%2fnativepakistan.com%2fphotos-of-bahawalpur%2f&p=DevEx,5025.1", "thumbnailUrl": "https://tse3.mm.bing.net/th?id=OIP.wS0pep46eEsGSSY39RNxLQEsDQ&pid=Api", "thumbnail": {"width": 300, "height": 20

我正在使用此代码但未获得准确的结果

start = 0
while True:                                                       
  p = data[start:].find('hostPageDisplayUrl')                         
  if p == -1: buffer                                            
  q = data[start+p+12:].find('hostPageDisplayUrl')                           
  r = data[start+p+q+12:].find('.')                             
  print (data[start+p+q+12:start+p+q+r+12] , file = log)        
  start = start+p+q+r+12

2 个答案:

答案 0 :(得分:0)

如上所述,您的数据似乎是一个JSON文件,但它并没有完全填充JSON格式。在确认它确实是有效的JSON here之后,您可以执行以下操作:

import json

def _finditem(obj, key):  # http://stackoverflow.com/a/14962509/2585092
    if key in obj: return obj[key]
    for k, v in obj.items():
        if isinstance(v,dict):
            item = _finditem(v, key)
            if item is not None:
                return item

def get_url(file_name):
    try:
        with open(file_name) as file:
            data = json.load(file)
    except FileNotFoundError:
        return None

    return _finditem(data, 'hostPageDisplayUrl')

或者使用正则表达式:

def find_urls(text):
    import re

    pattern = r'\"hostPageDisplayUrl\":\s*"([^"]*)"'
    return re.findall(pattern, text)

print(find_urls(test))

您的示例结果:
['wikimapia.org/1649944/Bahawalpur-Railway-Station', 'panoramio.com/photo/84118355', 'wikimapia.org/1649944/Bahawalpur-Railway-Station', 'nativepakistan.com/photos-of-bahawalpur']

警告:仅当您的网址不包含(转义)双引号"时才有效!

修改:对于基本网址:

def find_urls(text):
    import re

    pattern = r'\"hostPageDisplayUrl\":\s*"([^"]*)"'
    return re.findall(pattern, text)

def base_url(url):
    import re

    return re.search(r'(https?://)?(www\.)?([^/]*)', url)[3]

print([base_url(u) for u in find_urls(test)])

您的示例结果:
['wikimapia.org', 'panoramio.com', 'wikimapia.org', 'nativepakistan.com']

正则表达式解释

\"hostPageDisplayUrl\":\s*"([^"]*)"

  

我们搜索一个字符串,其前导和尾随"并将其分组:"([^"]*)"
    在此之前,对于任意数量的分隔符\s*,我们需要完整的字符串"hostPageDisplayUrl":

(https?://)?(www\.)?([^/]*)

  

忽略任何潜在客户http(s)://www.,我们希望在第一个/之前添加网址的部分并将其分组:([^/]*)

答案 1 :(得分:0)

从您的评论中我了解到文件数据是保存为文本文件的json。因此,您可以直接从文本文件加载json数据并获取值。你的代码应该是这样的

json_data=json.loads(open("json_file.txt").read())
for data in json_data:
    print data["hostPageDisplayUrl"] #this will print all the urls

我发布了这个,因为编程语言可以提高效率,减少代码行。