使用stat_text引发错误的文本分析:IndexError:字符串索引超出范围

时间:2015-11-10 04:47:30

标签: python web-scraping beautifulsoup indexoutofboundsexception

我正在使用名为text_stat的程序包对从Web上抓取的文本进行一些文本分析。在多种情况下,我收到了此错误

IndexError: string index out of range

代码:

# Importing packages
import requests
from bs4 import BeautifulSoup as bfs
from textstat.textstat import textstat

# Declaring URL
wikipedia_privacy_url = requests.get("https://wikimediafoundation.org/wiki/Privacy_policy")

# Parsing webpage content using Beautiful Soup
wikipedia_privacy_soup = bfs(wikipedia_privacy_url.content, "html.parser")

# Extracting the desired xPath
wikipedia_privacy_text = wikipedia_privacy_soup.find_all("div", {"class": "mw-body"})

# Declare an empty string to concatenate the multiple extracted strings
wikipedia_privacy = ''

# Extracting text
for text in wikipedia_privacy_text:

# Concatenating all the extracted text into one variable
    wikipedia_privacy = wikipedia_privacy + text.text

之后,我尝试使用text_stat包中的方法来分析提取的文本。

textstat.smog_index(wikipedia_privacy)

应返回一个浮点数,但我得到错误:

Traceback (most recent call last):
  File "/home/amin/Desktop/scrapy_project/text_stat.py", line 46, in text_pro
    SMOG_Index = textstat.smog_index(text)
  File "/home/amin/anaconda/lib/python2.7/site-packages/textstat/textstat.py", line 96, in smog_index
    poly_syllab = self.polysyllabcount(text)
  File "/home/amin/anaconda/lib/python2.7/site-packages/textstat/textstat.py", line 89, in polysyllabcount
wrds = self.syllable_count(word)
  File "/home/amin/anaconda/lib/python2.7/site-packages/textstat/textstat.py", line 29, in syllable_count
    if text[0] in vowels:
IndexError: string index out of range

textstat.smog_index(text)代码:

def smog_index(self, text):
            if self.sentence_count(text) >= 3:
                    poly_syllab = self.polysyllabcount(text)
                    # SMOG = 3.129 + round(poly_syllab**.5)
                    SMOG = (1.043 * (30*(poly_syllab/self.sentence_count(text)))**.5) + 3.1291
                    return round(SMOG, 1)

1 个答案:

答案 0 :(得分:0)

看来syllable_count做得不够好,检查文本是否为空。因此,当它尝试切片文本时会产生错误。我添加了几行简单的行来检查字符串是否有效。

修正:

def syllable_count(self, text):
            count = 0
            vowels = 'aeiouy'
            text = text.lower().strip(".:;?!)(")
            if text == None and text != "":
                    if text[0] in vowels:
                            count += 1
                    for index in range(1, len(text)):
                            if text[index] in vowels and text[index-1] not in vowels:
                                    count += 1
                    if text.endswith('e'):
                            count -= 1
                    if text.endswith('le'):
                            count += 1
                    if count == 0:
                            count += 1
                    count = count - (0.1*count)
            return (round(count))