Question

=“”

我是Python新手，没有使用BeautifulSoup的经验和urllib

我试图从其他问题中将frankenstein我自己的代码无效，所以我将尝试详细说明我想从下面的伪代码和描述中实现的代码：

import urllib2
from bs4 import BeautifulSoup
for eachurl in "urllist.txt":
    urllib read first (or 2nd or 3rd) url in list
    find.all("<form")
    if number of "<form" > 0:
        result = True
    if number of "<form" == 0:
        result = False

write result to csv/excel/html

table col 1 = url in urllist
table col 2 = result

基本上，我有一个带有URL列表的txt文件;我希望urllib逐个打开每个URL，看看html是否包含表单标签。（写入新文件）左列中的URL字符串和右侧的y或n，具体取决于查找所有表单标记是否返回大于0的结果，然后当然停止一旦URL文件在txt文件中耗尽。

Answer 1

使用requests代替urllib2。

试试这个：

import requests
from bs4 import BeautifulSoup

with open('data.txt', 'r') as data:
    for line in data:
        res = requests.get(line.strip()).content
        soup = BeautifulSoup(res, 'html.parser')
        with open('result.txt', 'a') as result_file:
            if soup.find_all('form'):
                result_file.write('{} y\n'.format(line.strip()))
            else:
                result_file.write('{} n\n'.format(line.strip()))

data.txt中

http://stackoverflow.com/questions/34263219/urllib2-and-beautifulsoup-loop-through-urls-and-return-whether-html-contains
http://blank.org/

的Result.txt

http://stackoverflow.com/questions/34263219/urllib2-and-beautifulsoup-loop-through-urls-and-return-whether-html-contains y
http://blank.org/ n

urllib2和BeautifulSoup - 循环遍历网址并返回html是否包含“<form”标记

1 个答案: