我有一个名为urlclean的DataFrame,格式为:
>>> urlclean
Matches Searching for URL List URL Status
14 2 Green Index http://greenindex.timberland.com/ Works
因为它来自初步数据帧,所以索引是" 14"在第1行。我写了一个二级代码来打开" URL列表"并搜索"搜索"下的所有可能重复的短语。 (在本例中为绿色索引)所选URL的文本如下:
for cindex, row in urlclean.iterrows():
print("starting clea nup")
sentence=[]
sentence=urlopen(urlclean.loc[cindex,'URL List']).read()
print("opening urls")
soup=[]
soup=BeautifulSoup(sentence)
print("Getsoup")
rsentence=[]
rsentence=(soup.get_text())
print("gettect")
indices = (i for i,word in enumerate(rsentence) if word==
(urlclean.loc[cindex,'Searching for']))
print("getting indices")
neighbors = []
for ind in indices:
neighbors.append(rsentence[ind-2:ind]+rsentence[ind:ind+2])
print("opening rsetence",(rsentence[ind-
2:ind]+rsentence[ind:ind+2]))
Resulting=[]
print("got Neighbors", neighbors)
N=len(neighbors)
for indexx in range(0,N):
Resulting_TEMP=[]
Resulting_TEMP=[(' '.join(map(str,neighbors[indexx])))]
print("resulting temp",Resulting_TEMP)
urlclean.loc[cindex,'All Phrases']=Resulting_TEMP
Resulting.append(Resulting_TEMP)
print("got results", Resulting)
我包含print()以跟进我的代码运行的时间点,它完全输出我以前派生的数据帧,然后在写完之后存在上面的代码:
starting cleanup
opening urls
Getsoup
gettect
getting indices
>>>
它永远不会启动ind和index之间的for循环,我错过了什么吗?我是python的新手,如果这是一个基本问题,请道歉。