在Django中显示已抓取的HTML div的所有段落

时间:2019-04-14 08:37:57

标签: python django beautifulsoup django-templates

因此,我从新闻网站中抓取了一个HTML div。这是哪段HTML:

<div class="cn-content">

<figure><img src="https://cimg.co/w/articles-attachments/1/5ca/71a090479e.jpg" sizes="(min-width: 640px) 720px, 100vw" srcset="https://cimg.co/w/articles-attachments/1/5ca/71a090479e.jpg 300w, https://cimg.co/w/articles-attachments/2/5ca/71a090479e.jpg 600w, https://cimg.co/w/articles-attachments/3/5ca/71a090479e.jpg 720w, https://cimg.co/w/articles-attachments/4/5ca/71a090479e.jpg 900w, https://cimg.co/w/articles-attachments/0/5ca/71a090479e.jpg 1337w" alt="OKEx Announced its First Token Sale via IEO 101" class="content-img"><figcaption>Source: iStock/baona</figcaption></figure>
<p>Major cryptocurrency exchange <b>OKEx</b> has announced an initial exchange offering (IEO) for the <b>BLOC</b> token, on their newly-presented OK Jumpstart token sale platform. The sale marks the first such endeavor of the exchange, joining the likes of <a href="https://cryptonews.com/ext/binance/" target="_blank" rel="nofollow noopener">Binance </a>and <a href="https://cryptonews.com/ext/bittrex/" target="_blank" rel="nofollow noopener">Bittrex </a>in the so-called killer app club.</p>
<p>The token in question is BLOC, native to the <b>Blockcloud</b> blockchain, and the sale is set to start at AM 12:00 UTC on April 10th. “Combining the advantages of blockchain and Future Internet technology, it reconstructs the technology layers below where current blockchain networks and Internet applications operate,” explains the project’s website. In short, it is a blockchain-based TCP/IP architecture, where TCP/IP is a suite of communication protocols used to interconnect network devices on the internet. </p>
<p>The token sale uses a subscription + allotment approach. Users will have a timeframe of 30 minutes to subscribe, and allotment will be based on the amount of the exchange’s native <a href="https://cryptonews.com/coins/okb/">OKB tokens</a> they hold over a seven-day period. The minimum threshold for a subscription is 500 OKB tokens (USD 1,145) held for those seven consecutive days, or buying in 3,500 OKB tokens on the last day - but to have their subscription guaranteed, users need to hold at least 2,500 OKB tokens daily or buy 17,500 OKB tokens on the final day before snapshot time.</p>
<p>The snapshots, which will be used to prove the users’ eligibility for participation, will be taken every day at AM 10:00 UTC, starting seven days before the token sale day. Then, users get their individual allotment coefficients based on the sum of OKB holdings in the moment of those snapshots. Users will have their individual subscription amounts in OKB locked up, and receive tokens based on a formula available on the OKEx blog. This formula bases the token allotment on both how many tokens users held during this period, as well as the amount of OKB they locked in as their subscription. </p>
<p>This move lets OKEx join the club of exchanges offering fundraising services. The latest example was Bittrex, where the token sale of <b>VeriBlock</b> tokens took a <a href="https://cryptonews.com/news/bittrex-beats-binance-in-selling-out-tokens-at-lightning-spe-3633.htm">mere 10 seconds</a>, beating even Binance’s speed of 22 seconds for the <b><a href="https://cryptonews.com/coins/fetch-ai/">Fetch.AI</a></b> token. Binance’s co-founder and CEO Changpeng Zhao coined the term “killer app” back in February, when he said in an interview that he views exchange-based fundraising as the next killer app.</p>
        </div>

因此,在我的模型中,我定义了一个属性来清理此HTML,因此我仅显示段落文本,如下所示:

@property
def description_clean(self):
    soup = BeautifulSoup(self.description)
    description = soup.find_all('div',attrs={"class":"cn-content"})
    for item in description:
        return item.find('p').text

但是,仅当我在带有{{ post.description_clean }}的模板中使用第一段时,这才呈现第一段

输出为:

  

主要的加密货币交易所OKEx在其新推出的OK Jumpstart令牌销售平台上宣布了BLOC令牌的初始交换产品(IEO)。此次交易标志着交易所的首次此类努力,在所谓的杀手级应用俱乐部中加入了Binance和Bittrex之类的公司。

为什么其他段落因为我正确地循环而不显示?

3 个答案:

答案 0 :(得分:1)

您需要:

main_div = soup.find('div', attrs={"class": "cn-content"})
paragraphs = main_div.find_all('p')
for p in paragraphs:
    # save p text

答案 1 :(得分:1)

在获得div标签之后,您并没有遍历所有p标签。 将您的代码更新为此:

@property
def description_clean(self):
    soup = BeautifulSoup(self.description)
    description = soup.find_all('div',attrs={"class":"cn-content"})
    p_tags = []  # result list
    for item in description:
        individual_p_tags = []  # preserve each individual "div"
        for p in item.find_all('p'):  # loop over all the "p" tags in each "div"
            individual_p_tags.append(p.text)  # append to a temp list
        p_tags.append("\n".join(individual_p_tags)) # convert the list to a string and append to the result list
    return p_tags  # this is a list of strings

答案 2 :(得分:0)

您可以返回段落列表

description = [item.text for item in soup.select('div.cn-content')]

然后

return description