我有一个代码,允许我提取某些新闻站点的链接。我只想显示城市名称-格但斯克的链接。但是,URL并非总是使用正确的拼写,因此我需要放入gdańsk,gdansk等。我也想从其他站点提取它。我能够添加更多的单词和站点,但是这使我为循环做更多的事情。您能指导我如何使代码更高效,更短吗?
第二个问题: 我将收到的链接导出到CSV文件中。我想将它们收集到那里,以便稍后对其进行分析。我发现,如果我在csv = open(plik,“ a”)中用“ a”替换“ w”,则应该附加文件。相反-没有任何反应。当它只是“ w”时,它将覆盖文件,但这就是我所需要的
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime
def data(timedateformat='complete'):
formatdaty = timedateformat.lower()
if timedateformat == 'rokmscdz':
return (str(datetime.now())).split(' ')[0]
elif timedateformat == 'dzmscrok':
return ((str(datetime.now())).split(' ')[0]).split('-')[2] + '-' + ((str(datetime.now())).split(' ')[0]).split('-')[1] + '-' + ((str(datetime.now())).split(' ')[0]).split('-')[0]
a = requests.get('http://www.dziennikbaltycki.pl')
b = requests.get('http://www.trojmiasto.pl')
zupa = bs(a.content, 'lxml')
zupka = bs(b.content, 'lxml')
rezultaty1 = [item['href'] for item in zupa.select(" [href*='Gdansk']")]
rezultaty2 = [item['href'] for item in zupa.select("[href*='gdansk']")]
rezultaty3 = [item['href'] for item in zupa.select("[href*='Gdańsk']")]
rezultaty4 = [item['href'] for item in zupa.select("[href*='gdańsk']")]
rezultaty5 = [item['href'] for item in zupka.select("[href*='Gdansk']")]
rezultaty6 = [item['href'] for item in zupka.select("[href*='gdansk']")]
rezultaty7 = [item['href'] for item in zupka.select("[href*='Gdańsk']")]
rezultaty8 = [item['href'] for item in zupka.select("[href*='gdańsk']")]
s = set()
plik = "dupa.csv"
csv = open(plik,"a")
for item in rezultaty1:
s.add(item)
for item in rezultaty2:
s.add(item)
for item in rezultaty3:
s.add(item)
for item in rezultaty4:
s.add(item)
for item in rezultaty5:
s.add(item)
for item in rezultaty6:
s.add(item)
for item in rezultaty7:
s.add(item)
for item in rezultaty8:
s.add(item)
for item in s:
print('Data wpisu: ' + data('dzmscrok'))
print('Link: ' + item)
print('\n')
csv.write('Data wpisu: ' + data('dzmscrok') + '\n')
csv.write(item + '\n'+'\n')
答案 0 :(得分:0)
理想情况下,为了提高性能并修剪代码,以免多次循环,您可以解析网页结果并通过将所有特殊字符替换为ASCII等效项(Replacing special characters with ASCII equivalent)进行归一化。
您可以通过更改代码以循环遍历Gdansk
而不是将结果合并到一个集合中来避免重复。我在下面修改了您的代码,并将其拆分为几个函数。
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime
def extract_links(content):
# Return a list of hrefs that mention any variation of the city Gdansk
variations = ['Gdansk', 'gdansk', 'Gdańsk', 'gdańsk']
result = []
for x in variations:
result = [*result, *[item['href'] for item in content.select(f"[href*={x}]")]]
return result
def data(timedateformat='complete'):
formatdaty = timedateformat.lower()
if timedateformat == 'rokmscdz':
return (str(datetime.now())).split(' ')[0]
elif timedateformat == 'dzmscrok':
return ((str(datetime.now())).split(' ')[0]).split('-')[2] + '-' + ((str(datetime.now())).split(' ')[0]).split('-')[1] + '-' + ((str(datetime.now())).split(' ')[0]).split('-')[0]
def get_links_from_urls(*urls):
# Request webpages then loop over the results to
# create a set of links that we will write to our file.
result = []
for rv in [requests.get(url) for url in urls]:
zupa = bs(rv.content, 'lxml')
result = [*result, *extract_links(zupa)]
return set(result)
def main():
# use pytons context manager to open 'ass.csv' and write out csv rows
plik = "dupa.csv"
with open(plik, 'a') as f:
for item in get_links_from_urls('http://www.dziennikbaltycki.pl', 'http://www.trojmiasto.pl'):
print('Data wpisu: ' + data('dzmscrok'))
print('Link: ' + item)
print('\n')
f.write(f'Data wpisu: {data("dzmscrok")},{item}\n')
main()
希望这会有所帮助,如果您的评论有任何问题,请告诉我。