我有一个包含某些RSS提要的URL的文本文件。我想找出哪些URL包含某些字符串(或单词列表)的标题或描述(或任何其他标记)。
至于现在,我能够获得URL,标题和标题(以及其他)。虽然不太确定如何继续。我想我会用正则表达式检查标签。如果我检查了一个URL标题并找到了一个wordmatch,那么我将如何再次检索该URL? URL需要连接到标签,例如.csv。有点困惑在这里。也许有人可以朝正确的方向射击我?
到目前为止我的路径:
import requests
from bs4 import BeautifulSoup
rssfeed = open('input.txt')
rss_source = rssfeed.read()
rss_sources = rss_source.split()
i=0
while i<len(rss_sources):
get_rss = requests.get(rss_sources[i])
rss_soup = BeautifulSoup(get_rss.text, 'html.parser')
rss_urls = rss_soup.find_all('link')
i=i+1
for url in rss_urls:
rss_all_urls = url.text
open_urls = requests.get(rss_all_urls)
target_urls_soup = BeautifulSoup(open_urls.text, 'html.parser')
urls_titles = target_urls_soup.title
urls_headlines = target_urls_soup.h1
print (rss_all_urls, urls_titles, urls_headlines)
答案 0 :(得分:0)
So you want to have an array of URLs. That array should contain certain URLs based on some conditions: - if the Title of that URL match one of the strings contained on an array
So first you need your arrays:
titlesToMatch = ['title1', 'title2', 'title3']
finalArrayWithURLs = []
then when you have your: rss_all_urls, urls_titles, urls_headlines for a URL you want to include on the finalArrayWithURLs just those ones that match one of the titles on the titleToMatch
for url in rss_urls:
rss_all_urls = url.text
open_urls = requests.get(rss_all_urls)
target_urls_soup = BeautifulSoup(open_urls.text, 'html.parser')
urls_titles = target_urls_soup.title
urls_headlines = target_urls_soup.h1
if any(item in urls_titles for item in titlesToMatch):
finalArrayWithURLs.push(url)
So after that you will have on the finalArrayWithURLs just those URLs where the title match one of the titles of your titlesToMatch array