如何使用Beautiful Soup从Google Form提取“ entry.id”?

时间:2020-09-03 04:54:35

标签: python html beautifulsoup google-forms

我正在尝试自动化 google表单。 我使用以下代码提取了输入文本:

import requests
from bs4 import BeautifulSoup
r=requests.get('https://docs.google.com/forms/d/e/1FAIpQLScio8_OkrBe7wtmw8GeUENvLFVUCAV6eyFOLWhfDbPuunG0Yw/viewform')
cont = BeautifulSoup(r.text,"lxml")
vals = cont.find_all('div', {'class':'freebirdFormviewerComponentsQuestionBaseTitle exportItemTitle freebirdCustomFont'})
print(vals[0].text)

结果:“名称”

但是我无法从以下位置提取entry.id:

<div jsname="06bZLc">
    <input type="hidden" name="entry.2005620554" value>
    <input type="hidden" name="entry.1045781291" value>
    <input type="hidden" name="entry.1065046570" value>
    <input type="hidden" name="entry.1166974658" value>
    <input type="hidden" name="entry.839337160" value>
</div>

我尝试使用以下代码:

v = cont.find('div', {'jsname': 'o6bZLc'})
x = v.find_all('input')
y = v.find_all_next('input',{'type':'hidden'})
print(x)
print(y)

结果:

[]
[<input name="fvv" type="hidden" value="1"/>, <input name="draftResponse" type="hidden" value='[null,null,"-1617617719642916240"]
'/>, <input name="pageHistory" type="hidden" value="0"/>, <input name="fbzx" type="hidden" value="-1617617719642916240"/>]

但是我无法让孩子进入<div jsname="06bZLc">。 你能帮我抱那些孩子吗?

URL:form link

2 个答案:

答案 0 :(得分:1)

请检查一下

from bs4 import BeautifulSoup 
html="""<div jsname="06bZLc">
    <input type="hidden" name="entry.2005620554" value>
    <input type="hidden" name="entry.1045781291" value>
    <input type="hidden" name="entry.1065046570" value>
    <input type="hidden" name="entry.1166974658" value>
    <input type="hidden" name="entry.839337160" value>
</div>"""
soup = BeautifulSoup(html,"lxml")
divs = soup.find_all("input")
for i in divs:
    print ((i.attrs['name']).split(".")[1])

输出:2005620554 1045781291 1065046570 1166974658 839337160

修改 使用您提供的Google表单链接

import requests
from bs4 import BeautifulSoup
r=requests.get('https://docs.google.com/forms/d/e/1FAIpQLScio8_OkrBe7wtmw8GeUENvLFVUCAV6eyFOLWhfDbPuunG0Yw/viewform')
cont = BeautifulSoup(r.text,"lxml")
divs = soup.find_all("input")
for i in divs:
    print ((i.attrs['name']).split(".")[1])

输出:2005620554 1045781291 1065046570 1166974658 839337160

修改 根据第二条评论

import requests
from bs4 import BeautifulSoup
r=requests.get('https://docs.google.com/forms/d/e/1FAIpQLScio8_OkrBe7wtmw8GeUENvLFVUCAV6eyFOLWhfDbPuunG0Yw/viewform')
cont = BeautifulSoup(r.text,"lxml")
divs = soup.find_all("input")
nums=[]
for i in divs:
    nums.extend((i.attrs['name']).split("."))
num=[int(i) for i in nums if i.isdigit()]

输出:[2005620554 1045781291 1065046570 1166974658 839337160]

答案 1 :(得分:1)

这些ID是通过JavaScript动态添加的,因此BeautifulSoup看不到它们。您可以尝试以下示例来加载它们:

import re
import json
import requests


url = 'https://docs.google.com/forms/d/e/1FAIpQLScio8_OkrBe7wtmw8GeUENvLFVUCAV6eyFOLWhfDbPuunG0Yw/viewform'
html_data = requests.get(url).text

data = json.loads( re.search(r'FB_PUBLIC_LOAD_DATA_ = (.*?);', html_data, flags=re.S).group(1) )

def get_ids(d):
    if isinstance(d, dict):
        for k, v in d.items():
            yield from get_ids(v)
    elif isinstance(d, list):
        if len(d) == 3 and d[1] is None:
            yield d[0]
        else:
            for v in d:
                yield from get_ids(v)

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for i in get_ids(data):
    print(i)

打印:

2005620554
1065046570
1166974658
839337160