Question

我有一个包含某些RSS提要的URL的文本文件。我想找出哪些URL包含某些字符串（或单词列表）的标题或描述（或任何其他标记）。

至于现在，我能够获得URL，标题和标题（以及其他）。虽然不太确定如何继续。我想我会用正则表达式检查标签。如果我检查了一个URL标题并找到了一个wordmatch，那么我将如何再次检索该URL？ URL需要连接到标签，例如.csv。有点困惑在这里。也许有人可以朝正确的方向射击我？

到目前为止我的路径：

import requests
from bs4 import BeautifulSoup

rssfeed = open('input.txt')
rss_source = rssfeed.read()
rss_sources = rss_source.split()

i=0
while i<len(rss_sources):
    get_rss = requests.get(rss_sources[i])
    rss_soup = BeautifulSoup(get_rss.text, 'html.parser')
    rss_urls = rss_soup.find_all('link')
    i=i+1

for url in rss_urls:
        rss_all_urls = url.text
        open_urls = requests.get(rss_all_urls)
        target_urls_soup = BeautifulSoup(open_urls.text, 'html.parser')
        urls_titles = target_urls_soup.title
        urls_headlines = target_urls_soup.h1
        print (rss_all_urls, urls_titles, urls_headlines)

Answer 1

So you want to have an array of URLs. That array should contain certain URLs based on some conditions: - if the Title of that URL match one of the strings contained on an array

So first you need your arrays:

titlesToMatch = ['title1', 'title2', 'title3']
finalArrayWithURLs = []

then when you have your: rss_all_urls, urls_titles, urls_headlines for a URL you want to include on the finalArrayWithURLs just those ones that match one of the titles on the titleToMatch

for url in rss_urls:
    rss_all_urls = url.text
    open_urls = requests.get(rss_all_urls)
    target_urls_soup = BeautifulSoup(open_urls.text, 'html.parser')
    urls_titles = target_urls_soup.title
    urls_headlines = target_urls_soup.h1

    if any(item in urls_titles for item in titlesToMatch):
        finalArrayWithURLs.push(url)

So after that you will have on the finalArrayWithURLs just those URLs where the title match one of the titles of your titlesToMatch array

仅当标题或描述包含％string％时才获取URL，标题和描述

1 个答案: