所以我有漂亮的汤代码,访问主网站的主页 并刮擦那里的链接。然而,当我在python中获得链接时,我似乎无法清理链接(在它转换为字符串之后)以与根URL连接。
import re
import requests
import bs4
list1=[]
def get_links():
regex3= re.compile('/[a-z\-]+/[a-z\-]+')
response = requests.get('http://noisetrade.com')
soup = bs4.BeautifulSoup(response.text)
links= soup.select('div.grid_info a[href]')
for link in links:
lk= link.get('href')
prtLk= regex3.findall(lk)
list1.append(prtLk)
def visit_pages():
url1=str(list1[1])
print(url)
get_links()
visit_pages()
产生输出:" [' / stevevantinemusic / unsolicited-material']"
期望的输出:" / stevevantinemusic /未经请求的材料"
我尝试过.strip()和.replace()以及re.sub / match / etc。 。 。我似乎无法隔离这些字符' [,\','这是我需要删除的字符,我用子字符串迭代它但感觉效率低下。我确定我错过了一些明显的东西。
答案 0 :(得分:1)
findall
会返回结果列表,因此您可以写:
for link in links:
lk = link.get('href')
urls = regex3.findall(lk)
if urls:
prtLk = urls[0]
list1.append(prtLk)
或更好,使用search
方法:
for link in links:
lk = link.get('href')
m = regex3.search(lk)
if m:
prtLk = m.group()
list1.append(prtLk)
这些括号是将包含一个元素的列表转换为字符串的结果。 例如:
l = ['text']
str(l)
结果:
"['text']"
答案 1 :(得分:0)
在这里,我使用regexp r'[\[\'\]]'
用空字符串替换任何不需要的字符:
$ cat pw.py
import re
def visit_pages():
url1="['/stevevantinemusic/unsolicited-material']"
url1 = re.sub(r'[\[\'\]]','',url1)
print(url1)
visit_pages()
$ python pw.py
/stevevantinemusic/unsolicited-material
答案 2 :(得分:0)
以下是我认为你要做的一个例子:
>>> import bs4
>>> with open('noise.html', 'r') as f:
... lines = f.read()
...
>>> soup = bs4.BeautifulSoup(lines)
>>> root_url = 'http://noisetrade.com'
>>> for link in soup.select('div.grid_info a[href]'):
... print(root_url + link.get('href'))
...
http://noisetrade.com/stevevantinemusic
http://noisetrade.com/stevevantinemusic/unsolicited-material
http://noisetrade.com/jessicarotter
http://noisetrade.com/jessicarotter/winter-sun
http://noisetrade.com/geographermusic
http://noisetrade.com/geographermusic/live-from-the-el-rey-theatre
http://noisetrade.com/kaleo
http://noisetrade.com/kaleo/all-the-pretty-girls-ep
http://noisetrade.com/aviddancer
http://noisetrade.com/aviddancer/an-introduction
http://noisetrade.com/thinkr
http://noisetrade.com/thinkr/quiet-kids-ep
http://noisetrade.com/timcaffeemusic
http://noisetrade.com/timcaffeemusic/from-conversations
http://noisetrade.com/pearl
http://noisetrade.com/pearl/hello
http://noisetrade.com/staceyrandolmusic
http://noisetrade.com/staceyrandolmusic/fables-noisetrade-sampler
http://noisetrade.com/sleepyholler
http://noisetrade.com/sleepyholler/sleepy-holler
http://noisetrade.com/sarahmcgowanmusic
http://noisetrade.com/sarahmcgowanmusic/indian-summer
http://noisetrade.com/briandunne
http://noisetrade.com/briandunne/songs-from-the-hive
请记住,bs4也有自己的类型。
调试脚本的好方法是放置:
for link in links:
import pdb;pdb.set_trace() # the script will stop for debugging here
lk= link.get('href')
prtLk= regex3.findall(lk)
list1.append(prtLk)
您要调试的任何地方。
然后你可以在pdb
中执行类似的操作:
next
l
print(type(lk))
print(links)
dir()
dir(links)
dir(lk)