JSoup摆脱了身体

时间:2018-02-07 21:29:20

标签: java scala jsoup

无论如何使用JSoup保留纯文本页面中的换行符吗?我试图拉robots.txt而不是一行一行,它将整个身体标签拉成一行。

var response = Jsoup.connect("http://www.facebook.com/robots.txt").userAgent(userAgent).followRedirects(true).execute()
println(response.parse().body().text())

我在一行上得到文本回复,如下所示:

# Notice: Crawling Facebook is prohibited unless you have express written # permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php User-agent: Applebot Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: baiduspider Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Bingbot Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Googlebot Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: ia_archiver Disallow: / Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: msnbot Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Naverbot Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: seznambot Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Slurp Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: teoma Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Twitterbot Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Yandex Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Yeti Disallow: /ajax/ Disallow: /album.php Disallow: /checkpoint/ Disallow: /contact_importer/ Disallow: /feeds/ Disallow: /file_download.php Disallow: /hashtag/ Disallow: /l.php Disallow: /live/ Disallow: /moments_app/ Disallow: /p.php Disallow: /photo.php Disallow: /photos.php Disallow: /sharer/ User-agent: Applebot Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: baiduspider Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: Bingbot Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: Googlebot Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: ia_archiver Allow: /about/privacy Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /full_data_use_policy Allow: /legal/terms Allow: /policy.php Allow: /safetycheck/ User-agent: msnbot Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: Naverbot Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: seznambot Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: Slurp Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: teoma Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: Twitterbot Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: Yandex Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: Yeti Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /safetycheck/ User-agent: * Disallow: / ]

我希望逐行解析文件(比如在浏览器中查看文件,并在其上执行正则表达式。任何帮助都将不胜感激。

谢谢

2 个答案:

答案 0 :(得分:1)

如何以不同的方式执行此操作并将文件拉出来。 robots.txt显然是一个文本文件,因此我们可以抓取它而不是试图抓取HTML。

这仍然使用Jsoup,与以前略有不同。

Connection.Response robotsText = Jsoup.connect( "https://www.facebook.com/robots.txt" ).execute();
FileOutputStream fileOutputStream = ( new FileOutputStream( new File( "robots.txt" ) ) );
fileOutputStream.write( robotsText.bodyAsBytes() );
fileOutputStream.close();

答案 1 :(得分:1)

unflattened文本可通过Jsoup中的TextNode获得。 E.g。

Document doc = Jsoup.connect("http://www.facebook.com/robots.txt").get();
doc.body().textNodes().get(0).getWholeText()