我正在尝试自动化 google表单。 我使用以下代码提取了输入文本:
import requests
from bs4 import BeautifulSoup
r=requests.get('https://docs.google.com/forms/d/e/1FAIpQLScio8_OkrBe7wtmw8GeUENvLFVUCAV6eyFOLWhfDbPuunG0Yw/viewform')
cont = BeautifulSoup(r.text,"lxml")
vals = cont.find_all('div', {'class':'freebirdFormviewerComponentsQuestionBaseTitle exportItemTitle freebirdCustomFont'})
print(vals[0].text)
结果:“名称”
但是我无法从以下位置提取entry.id:
<div jsname="06bZLc">
<input type="hidden" name="entry.2005620554" value>
<input type="hidden" name="entry.1045781291" value>
<input type="hidden" name="entry.1065046570" value>
<input type="hidden" name="entry.1166974658" value>
<input type="hidden" name="entry.839337160" value>
</div>
我尝试使用以下代码:
v = cont.find('div', {'jsname': 'o6bZLc'})
x = v.find_all('input')
y = v.find_all_next('input',{'type':'hidden'})
print(x)
print(y)
结果:
[]
[<input name="fvv" type="hidden" value="1"/>, <input name="draftResponse" type="hidden" value='[null,null,"-1617617719642916240"]
'/>, <input name="pageHistory" type="hidden" value="0"/>, <input name="fbzx" type="hidden" value="-1617617719642916240"/>]
但是我无法让孩子进入<div jsname="06bZLc">
。
你能帮我抱那些孩子吗?
URL:form link
答案 0 :(得分:1)
请检查一下
from bs4 import BeautifulSoup
html="""<div jsname="06bZLc">
<input type="hidden" name="entry.2005620554" value>
<input type="hidden" name="entry.1045781291" value>
<input type="hidden" name="entry.1065046570" value>
<input type="hidden" name="entry.1166974658" value>
<input type="hidden" name="entry.839337160" value>
</div>"""
soup = BeautifulSoup(html,"lxml")
divs = soup.find_all("input")
for i in divs:
print ((i.attrs['name']).split(".")[1])
输出:2005620554 1045781291 1065046570 1166974658 839337160
修改 使用您提供的Google表单链接
import requests
from bs4 import BeautifulSoup
r=requests.get('https://docs.google.com/forms/d/e/1FAIpQLScio8_OkrBe7wtmw8GeUENvLFVUCAV6eyFOLWhfDbPuunG0Yw/viewform')
cont = BeautifulSoup(r.text,"lxml")
divs = soup.find_all("input")
for i in divs:
print ((i.attrs['name']).split(".")[1])
输出:2005620554 1045781291 1065046570 1166974658 839337160
修改 根据第二条评论
import requests
from bs4 import BeautifulSoup
r=requests.get('https://docs.google.com/forms/d/e/1FAIpQLScio8_OkrBe7wtmw8GeUENvLFVUCAV6eyFOLWhfDbPuunG0Yw/viewform')
cont = BeautifulSoup(r.text,"lxml")
divs = soup.find_all("input")
nums=[]
for i in divs:
nums.extend((i.attrs['name']).split("."))
num=[int(i) for i in nums if i.isdigit()]
输出:[2005620554 1045781291 1065046570 1166974658 839337160]
答案 1 :(得分:1)
这些ID是通过JavaScript动态添加的,因此BeautifulSoup看不到它们。您可以尝试以下示例来加载它们:
import re
import json
import requests
url = 'https://docs.google.com/forms/d/e/1FAIpQLScio8_OkrBe7wtmw8GeUENvLFVUCAV6eyFOLWhfDbPuunG0Yw/viewform'
html_data = requests.get(url).text
data = json.loads( re.search(r'FB_PUBLIC_LOAD_DATA_ = (.*?);', html_data, flags=re.S).group(1) )
def get_ids(d):
if isinstance(d, dict):
for k, v in d.items():
yield from get_ids(v)
elif isinstance(d, list):
if len(d) == 3 and d[1] is None:
yield d[0]
else:
for v in d:
yield from get_ids(v)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for i in get_ids(data):
print(i)
打印:
2005620554
1065046570
1166974658
839337160