我是Python新手,没有使用BeautifulSoup的经验和urllib
我试图从其他问题中将frankenstein我自己的代码无效,所以我将尝试详细说明我想从下面的伪代码和描述中实现的代码:
import urllib2
from bs4 import BeautifulSoup
for eachurl in "urllist.txt":
urllib read first (or 2nd or 3rd) url in list
find.all("<form")
if number of "<form" > 0:
result = True
if number of "<form" == 0:
result = False
write result to csv/excel/html
table col 1 = url in urllist
table col 2 = result
基本上,我有一个带有URL列表的txt文件;我希望urllib逐个打开每个URL,看看html是否包含表单标签。 (写入新文件)左列中的URL字符串和右侧的y
或n
,具体取决于查找所有表单标记是否返回大于0的结果,然后当然停止一旦URL文件在txt文件中耗尽。
答案 0 :(得分:0)
使用requests
代替urllib2
。
试试这个:
import requests
from bs4 import BeautifulSoup
with open('data.txt', 'r') as data:
for line in data:
res = requests.get(line.strip()).content
soup = BeautifulSoup(res, 'html.parser')
with open('result.txt', 'a') as result_file:
if soup.find_all('form'):
result_file.write('{} y\n'.format(line.strip()))
else:
result_file.write('{} n\n'.format(line.strip()))
data.txt中
http://stackoverflow.com/questions/34263219/urllib2-and-beautifulsoup-loop-through-urls-and-return-whether-html-contains
http://blank.org/
的Result.txt
http://stackoverflow.com/questions/34263219/urllib2-and-beautifulsoup-loop-through-urls-and-return-whether-html-contains y
http://blank.org/ n