urllib2和BeautifulSoup - 循环遍历网址并返回html是否包含“<form”标记

时间:2015-12-14 09:14:51

标签: python html beautifulsoup urllib2

=“”

我是Python新手,没有使用BeautifulSoup的经验和urllib

我试图从其他问题中将frankenstein我自己的代码无效,所以我将尝试详细说明我想从下面的伪代码和描述中实现的代码:

import urllib2
from bs4 import BeautifulSoup
for eachurl in "urllist.txt":
    urllib read first (or 2nd or 3rd) url in list
    find.all("<form")
    if number of "<form" > 0:
        result = True
    if number of "<form" == 0:
        result = False

write result to csv/excel/html

table col 1 = url in urllist
table col 2 = result

基本上,我有一个带有URL列表的txt文件;我希望urllib逐个打开每个URL,看看h​​tml是否包含表单标签。 (写入新文件)左列中的URL字符串和右侧的yn,具体取决于查找所有表单标记是否返回大于0的结果,然后当然停止一旦URL文件在txt文件中耗尽。

1 个答案:

答案 0 :(得分:0)

使用requests代替urllib2

试试这个:

import requests
from bs4 import BeautifulSoup

with open('data.txt', 'r') as data:
    for line in data:
        res = requests.get(line.strip()).content
        soup = BeautifulSoup(res, 'html.parser')
        with open('result.txt', 'a') as result_file:
            if soup.find_all('form'):
                result_file.write('{} y\n'.format(line.strip()))
            else:
                result_file.write('{} n\n'.format(line.strip()))

data.txt中

http://stackoverflow.com/questions/34263219/urllib2-and-beautifulsoup-loop-through-urls-and-return-whether-html-contains
http://blank.org/

的Result.txt

http://stackoverflow.com/questions/34263219/urllib2-and-beautifulsoup-loop-through-urls-and-return-whether-html-contains y
http://blank.org/ n