我正在使用名为text_stat的程序包对从Web上抓取的文本进行一些文本分析。在多种情况下,我收到了此错误
IndexError: string index out of range
代码:
# Importing packages
import requests
from bs4 import BeautifulSoup as bfs
from textstat.textstat import textstat
# Declaring URL
wikipedia_privacy_url = requests.get("https://wikimediafoundation.org/wiki/Privacy_policy")
# Parsing webpage content using Beautiful Soup
wikipedia_privacy_soup = bfs(wikipedia_privacy_url.content, "html.parser")
# Extracting the desired xPath
wikipedia_privacy_text = wikipedia_privacy_soup.find_all("div", {"class": "mw-body"})
# Declare an empty string to concatenate the multiple extracted strings
wikipedia_privacy = ''
# Extracting text
for text in wikipedia_privacy_text:
# Concatenating all the extracted text into one variable
wikipedia_privacy = wikipedia_privacy + text.text
之后,我尝试使用text_stat包中的方法来分析提取的文本。
textstat.smog_index(wikipedia_privacy)
应返回一个浮点数,但我得到错误:
Traceback (most recent call last):
File "/home/amin/Desktop/scrapy_project/text_stat.py", line 46, in text_pro
SMOG_Index = textstat.smog_index(text)
File "/home/amin/anaconda/lib/python2.7/site-packages/textstat/textstat.py", line 96, in smog_index
poly_syllab = self.polysyllabcount(text)
File "/home/amin/anaconda/lib/python2.7/site-packages/textstat/textstat.py", line 89, in polysyllabcount
wrds = self.syllable_count(word)
File "/home/amin/anaconda/lib/python2.7/site-packages/textstat/textstat.py", line 29, in syllable_count
if text[0] in vowels:
IndexError: string index out of range
textstat.smog_index(text)代码:
def smog_index(self, text):
if self.sentence_count(text) >= 3:
poly_syllab = self.polysyllabcount(text)
# SMOG = 3.129 + round(poly_syllab**.5)
SMOG = (1.043 * (30*(poly_syllab/self.sentence_count(text)))**.5) + 3.1291
return round(SMOG, 1)
答案 0 :(得分:0)
看来syllable_count做得不够好,检查文本是否为空。因此,当它尝试切片文本时会产生错误。我添加了几行简单的行来检查字符串是否有效。
修正:
def syllable_count(self, text):
count = 0
vowels = 'aeiouy'
text = text.lower().strip(".:;?!)(")
if text == None and text != "":
if text[0] in vowels:
count += 1
for index in range(1, len(text)):
if text[index] in vowels and text[index-1] not in vowels:
count += 1
if text.endswith('e'):
count -= 1
if text.endswith('le'):
count += 1
if count == 0:
count += 1
count = count - (0.1*count)
return (round(count))