Question

目标：编写一个屏幕抓取工具，检查网页是否包含某些内容。

方式：有两个配置文件，一个包含URL列表，另一个包含要搜索的字符串列表。打开两个文件并将其内容读入两个数组。

循环遍历URL数组（让我们称之为循环A）。

对于每个URL，使用urllib在页面中读取，并通过拆分\ n将其拆分为数组。循环遍历字符串列表（循环B）。

对于字符串中的每一行，循环遍历HTML（循环C）行，并在每一行上进行模式匹配。如果找到匹配项，则将结果记录在输出文件中。

问题：它正在打开配置文件。循环A工作正常。循环B和C仅在循环A的第一次传递时工作。在循环A的第二次和第三次传递中，循环B没有发生。

原谅我输入了这么多调试代码。一个奇怪的怪癖是我看到代码第52行产生的输出中出现了一个神秘的'b'。

配置文件内容：

urls.txt

http://uk.norton.com
http://us.norton.com
http://ie.norton.com

targetStrings.txt

Norton Online Backup
Norton Ultimate Help Desk

代码：

# Import the modules we need
import urllib.request
import re

# Open the files we need
out = open('out.txt', 'w')
urls=open('urls.txt','r')
targetFile=open('targetStrings.txt','r',encoding=('utf-8'))

# function to take a URL, open the HTML, split it into an array, and return it
def getPage(url):
    return urllib.request.urlopen(url).read().decode().split('\n')

# function to kick out to an output file
def outFile(output):
    out.write(output + '\n')

# Function to test for matches    
def match(string, pageLine):
    if re.search(string.encode('utf-8'),pageLine):
        return True
    else:
        return False


#Loop through the URLs - Loop A
for url in urls:
    url=url.rstrip('\n')
    outFile('\nOpening ' + url) 
#    response=urllib.request.urlopen(url)
#    html=response.read().decode()
    html=getPage(str(url))
    if html !='':
        outFile('Page read successfully')
    else:
        outFile('Problem reading page')

    outFile(url + ' has ' + str(len(html)) + ' lines')

    #Loop through targetStrings - Loop B. This is only happening on the first pass of loop A.
    for line in targetFile:
        outFile('Beginning \'for line in targetFile:\' loop')
        line=line.rstrip('\n') #take out any \n newline characters at the end
        outFile('Looking for ' + line + ' in ' + url)
        foundCount=0

        # Loop through current HTML file - Loop C
        pageLineNumber=0
        for pageLine in html:
            pageLineNumber+=1
            pageLine=pageLine.encode('utf-8')
            outFile('Looking for ' + str(line) + ' in ' + str(pageLineNumber) + ' ' + str(pageLine))
            if match(line, pageLine):
                foundCount+=1
                outFile('FoundCount is ' + str(foundCount))
        outFile('Searched ' + str(pageLineNumber) + ' lines')

        if foundCount==0:
            outFile('Did not find ' + str(line))
        else:
            s=''
            if foundCount>0:
                s='s'
            outFile('Found ' + line + ' ' + str(foundCount) + ' time' + s)
            foundCount=0
f.close()
urls.close()
targetFile.close()

Answer 1

问题不在你的嵌套for循环中。在for line in targetFile:，您正在阅读＆＃34; targetFile＆＃34;在外循环的每次迭代中。一次完全读取，读取指针设置为文件末尾，您不能多次读取文件对象。您需要创建一个新文件对象或使用file_obj.seek(0)将读指针再次移动到文件的开头。因此，您可以在targetFile.seek(0)循环后添加for line in targetFile:作为外循环的最后一行。

for url in urls:
    # outer loop code
    for line in targetFile:
        # inner loop code
    targetFile.seek(0)

f.close()
urls.close()
targetFile.close()

@pvg建议的其他更好的选择是读取列表中的所有行

targetLines=open('targetStrings.txt','r',encoding=('utf-8')).readlines()

然后使用该列表

for line in targetLines:

因为它会比一次又一次地读取文件更有效。

Answer 2

这里的问题是当你第一次遍历文件targetFile时，其中的读指针位于文件的末尾，当你再次尝试循环时，你什么也没有得到它，因为你已经到底。你可以做2想解决这个问题

在您迭代之前或之后，将read指针放在find targetFile.seek(0)之前。
在变量中读取文件的整行并迭代它。

Python嵌套循环仅适用于第一次传递

2 个答案: