我有一个文档对象:
Document secDoc = Jsoup.connect(a.attr("abs:href")).timeout(30*1000).get();
String txt = secDoc.text();
现在当我调试上面的内容并检查secDoc的值时,我得到了一个正常的页面源代码,其中包含一个元素:
For questions about your order, including anything shipping or billing related, please email <script type="text/javascript">write_email('oatmealsupport','gmail.com')</script>.
如果您自己看到该网页,则可以看到以下行:For questions about your order, including anything shipping or billing related, please email oatmealsupport@gmail.com. We only do email support at this time.
有趣的是,此脚本会在页面上生成电子邮件ID。在做一个检查元素时,我得到:
<p>
For questions about your order, including anything shipping or billing related, please email <a href="mailto:oatmealsupport@gmail.com">oatmealsupport@gmail.com</a><script type="text/javascript">write_email('oatmealsupport','gmail.com')</script>.
We only do email support at this time.<br><br>
Hours of operation: <strong>Monday-Friday 8am - 6pm PT.</strong>
<br>
<strong>Shipping Times</strong>:
We strive to fulfill the orders within 3-5 working days. When we are really busy we may take a day or two longer.
We ship orders Monday - Friday, so if your order is placed Friday evening we may not be able to process it until the following Monday.
If we are behind, it may be a few days before we respond. The Oatmeal is an extremely small operation so please be patient.
<br>
<a href="http://shop.theoatmeal.com/pages/shipping">More Shipping Info</a><br><br>
Questions about shirt sizes? <a href="http://shop.theoatmeal.com/pages/shipping#shirts">Shirt Sizing Info</a>
</p>
所以主播:<a href="mailto:oatmealsupport@gmail.com">oatmealsupport@gmail.com</a>
是由脚本生成的。
无论如何我可以使用Jsoup(或任何其他方法)获得此锚点吗?
答案 0 :(得分:1)
对于此特定站点,地址的用户和域部分位于脚本标记中,因此选择脚本标记,获取其文本,使用正则表达式解析该文本,并使用{{连接用户和电子邮件介于两者之间。您的选择器可能只是@
,假设script:contains(write_email)
未在页面的其他位置使用。这只能起作用,因为地址在文本中公开,即使它分为两部分。
通常,Jsoup不是JavaScript引擎。如果你想看到使用网络浏览器的人看到同一页面,你可以尝试像Selenium这样的浏览器自动化工具。