Question

我知道，这个问题或者类似的问题已经被问过了。但是我发现的那些并没有为我提供正确的答案所以我在这里问。

如何获取HTML网站的文本以及我可以将其与其他给定值进行比较？

假设我有这个网页：

<html>
<head>
<title>This is my page</title>

<center>
<div class="mon_title">Some title here</div>
<table class="mon_list" >
<tr class='list'><th class="list" align="center"></th><th class="list" align="center">Set 1</th><th class="list" align="center">Set 2</th><th class="list" align="center">Set 4</th><th class="list" align="center">Set 5</th><th class="list" align="center">Set 6</th><th class="list" align="center">Set 7</th><th class="list" align="center">Set 8</th><th class="list" align="center">Set 9</th><th class="list" align="center">Set 10</th><th class="list" align="center">Set 11</th><th class="list" align="center">Set 12</th></tr>
<tr class='list even'><td class="list" align="center">Value 1</td><td class="list" align="center">Value 2</td><td class="list" align="center">Value 3</td><td class="list" align="center">Value 4</td><td class="list" align="center">Value 5</td><td class="list">Value 6</td><td class="list">Value 7</td><td class="list" align="center">Value 8</td><td class="list" align="center">Value 9</td><td class="list" align="center">Value 10</td><td class="list" align="center">Value 11</td><td class="list" align="center">Value 12</td></tr>
<tr class='list even'><td class="list" align="center">Value 1</td><td class="list" align="center">Value 2</td><td class="list" align="center">Value 3</td><td class="list" align="center">Value 4</td><td class="list" align="center">Value 5</td><td class="list">Value 6</td><td class="list">Value 7</td><td class="list" align="center">Value 8</td><td class="list" align="center">Value 9</td><td class="list" align="center">Value 10</td><td class="list" align="center">Value 11</td><td class="list" align="center">Value 12</td></tr>
</table>

对于任何拼写错误或遗漏部分，我们深表歉意。我希望你明白这一点。所以现在，我的程序应该读取表格中的某些给定值是否与给定的值相同，例如“值2在某处？”如果实际上它应该问“同一行中的值是5吗？”

这通常是可能的吗？构建该计划需要多少努力？

所有我得到的是在python中使用此代码下载实际的完整HTML网页：

import requests

url = 'http://some.random.site.com/you/ad/here'
print (requests.get(url).text)

它给了我上面看到的HTML代码。相反，我想要你在网站上点击CTRL + A并将其复制+粘贴到编辑器文件中时得到的结果。

PS：我对编程很新，很抱歉，如果有任何概念，我真的没有或者喜欢它。对不起我的英语我是德国人......

Answer 1

您可以使用urllib和re来查找值：

import urllib.request
import re

data = str(urllib.request.urlopen(url).read())

values = re.findall("Value \d+", data)

输出：

['Value 1', 'Value 2', 'Value 3', 'Value 4', 'Value 5', 'Value 6', 'Value 7', 'Value 8', 'Value 9', 'Value 10', 'Value 11', 'Value 12', 'Value 1', 'Value 2', 'Value 3', 'Value 4', 'Value 5', 'Value 6', 'Value 7', 'Value 8', 'Value 9', 'Value 10', 'Value 11', 'Value 12']

Answer 2

您可以使用beautiful soup等解析库。您的问题也已经回答here。

Answer 3

import requests
from bs4 import BeautifulSoup as soup
url = 'http://some.random.site.com/you/ad/here'
text=soup(requests.get(url).text)
text=text.find(class_='mon_list')
listy=[]
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    listy.append([elem.get_text() for elem in cols])
print(listy)

这将在嵌套列表中为您提供：

[[], ['Value 1', 'Value 2', 'Value 3', 'Value 4', 'Value 5', 'Value 6', 'Value 7', 'Value 8', 'Value 9', 'Value 10', 'Value 11', 'Value 12'], ['Value 1', 'Value 2', 'Value 3', 'Value 4', 'Value 5', 'Value 6', 'Value 7', 'Value 8', 'Value 9', 'Value 10', 'Value 11', 'Value 12']]

阅读python

3 个答案: