在python中清理链接URL的字符串

时间:2015-03-07 20:14:09

标签: python regex web-scraping beautifulsoup python-requests

所以我有漂亮的汤代码,访问主网站的主页 并刮擦那里的链接。然而,当我在python中获得链接时,我似乎无法清理链接(在它转换为字符串之后)以与根URL连接。

import re
import requests
import bs4

list1=[]

def get_links():

    regex3= re.compile('/[a-z\-]+/[a-z\-]+')
    response = requests.get('http://noisetrade.com')
    soup = bs4.BeautifulSoup(response.text)
    links=  soup.select('div.grid_info  a[href]')
    for link in links:
       lk= link.get('href')
       prtLk= regex3.findall(lk)
       list1.append(prtLk)


def visit_pages():
    url1=str(list1[1])
    print(url)

get_links()
visit_pages()

产生输出:" [' / stevevantinemusic / unsolicited-material']"

期望的输出:" / stevevantinemusic /未经请求的材料"

我尝试过.strip()和.replace()以及re.sub / match / etc。 。 。我似乎无法隔离这些字符' [,\','这是我需要删除的字符,我用子字符串迭代它但感觉效率低下。我确定我错过了一些明显的东西。

3 个答案:

答案 0 :(得分:1)

findall会返回结果列表,因此您可以写:

for link in links:
    lk = link.get('href')    
    urls = regex3.findall(lk)   
    if urls:
        prtLk = urls[0]
        list1.append(prtLk)

或更好,使用search方法:

for link in links:
    lk = link.get('href')    
    m = regex3.search(lk)
    if m:
        prtLk = m.group()
        list1.append(prtLk)

这些括号是将包含一个元素的列表转换为字符串的结果。 例如:

l = ['text']
str(l)

结果:

"['text']"

答案 1 :(得分:0)

在这里,我使用regexp r'[\[\'\]]'用空字符串替换任何不需要的字符:

$ cat pw.py
import re

def visit_pages():
    url1="['/stevevantinemusic/unsolicited-material']"
    url1 = re.sub(r'[\[\'\]]','',url1)
    print(url1)

visit_pages()

$ python pw.py
/stevevantinemusic/unsolicited-material

答案 2 :(得分:0)

以下是我认为你要做的一个例子:

>>> import bs4
>>> with open('noise.html', 'r') as f:
...     lines = f.read()
... 
>>> soup = bs4.BeautifulSoup(lines)
>>> root_url = 'http://noisetrade.com'
>>> for link in soup.select('div.grid_info a[href]'):
...     print(root_url + link.get('href'))
... 
http://noisetrade.com/stevevantinemusic
http://noisetrade.com/stevevantinemusic/unsolicited-material
http://noisetrade.com/jessicarotter
http://noisetrade.com/jessicarotter/winter-sun
http://noisetrade.com/geographermusic
http://noisetrade.com/geographermusic/live-from-the-el-rey-theatre
http://noisetrade.com/kaleo
http://noisetrade.com/kaleo/all-the-pretty-girls-ep
http://noisetrade.com/aviddancer
http://noisetrade.com/aviddancer/an-introduction
http://noisetrade.com/thinkr
http://noisetrade.com/thinkr/quiet-kids-ep
http://noisetrade.com/timcaffeemusic
http://noisetrade.com/timcaffeemusic/from-conversations
http://noisetrade.com/pearl
http://noisetrade.com/pearl/hello
http://noisetrade.com/staceyrandolmusic
http://noisetrade.com/staceyrandolmusic/fables-noisetrade-sampler
http://noisetrade.com/sleepyholler
http://noisetrade.com/sleepyholler/sleepy-holler
http://noisetrade.com/sarahmcgowanmusic
http://noisetrade.com/sarahmcgowanmusic/indian-summer
http://noisetrade.com/briandunne
http://noisetrade.com/briandunne/songs-from-the-hive

请记住,bs4也有自己的类型。

调试脚本的好方法是放置:

for link in links:
   import pdb;pdb.set_trace() # the script will stop for debugging here
   lk= link.get('href')
   prtLk= regex3.findall(lk)
   list1.append(prtLk)

您要调试的任何地方。

然后你可以在pdb中执行类似的操作:

next
l
print(type(lk))
print(links)
dir()
dir(links)
dir(lk)