我试图从zone-h.org网页中抓取数据。首先,我通过在脚本中添加cookie来绕过网页验证码错误。然后,我用BeautifulSoup刮擦桌子并存放桌子。但是,其中一列没有纯文本。信息以引号(“ ...”)出现。
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "http://www.zone-h.org/archive/filter=1/published=0/domain=twitter/fulltext=1/page=1?"
cookie = {'PHPSESSID': 'XXXXXXXXXXX','ZHE':'XXXXXXXXXXXX'}
response = requests.post(url, cookies=cookie)
print(response)
data = response.text
soup = BeautifulSoup(data,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))[0]
df_domain = pd.DataFrame(df)
df_domain.head()
如何从L(位置)列中获取数据?该列的来源是;
<td><img src="/images/cflags/png/us.png" alt="United States" title="United States"></td>
您建议如何从标题中获取数据(美国)?
答案 0 :(得分:1)
要从列中获取数据,必须逐行遍历表并从<img>
属性title=
获取数据:
from bs4 import BeautifulSoup
import requests
url = "http://www.zone-h.org/archive/filter=1/published=0/domain=twitter/fulltext=1/page=1?"
cookie = {'PHPSESSID': 'XXX','ZHE':'XXX'}
response = requests.post(url, cookies=cookie)
data = response.text
soup = BeautifulSoup(data,'lxml')
rows = []
for tr in soup.select('tr')[:-2]:
row = []
for td in tr.select('td'):
if td.text.strip():
row.append(td.text.strip())
else:
img = td.select_one('img[title]')
if img:
row.append(img['title'])
else:
row.append('')
rows.append(row)
from textwrap import shorten
print(''.join('{: <20}'.format(d) for d in rows[0]))
for row in rows[1:]:
print(''.join('{: <20}'.format(shorten(d, 20)) for d in row))
最终表位于变量rows
中。您可以将其导入熊猫。
要将其打印到屏幕上:
Time Notifier H M R L Domain OS View
2019/02/11 RxR [...] Linux mirror
2019/01/24 Al1ne3737 H United States twitterlike.com.br Linux mirror
2019/01/23 Psycho Crew H Cyprus [...] Unknown mirror
2018/08/11 Iran is winner trump R United States [...] Unknown mirror
2018/05/08 İllegalHackerz Turkey [...] MacOSX mirror
2018/01/04 BIGM4N R United States [...] Unknown mirror
2017/09/27 Mr.str3_at United States [...] Linux mirror
2017/08/02 SkullZ R United States [...] Unknown mirror
2017/07/21 Dex hacker H United States cdn- [...] Linux mirror
2017/07/21 Dex hacker H United States [...] Linux mirror
2017/07/08 KusterAttacker H France [...] Linux mirror
2017/05/22 GeNErAL United States [...] Linux mirror
2017/02/11 SA3D HaCk3D United States [...] Linux mirror
2017/02/07 SA3D HaCk3D United States [...] Linux mirror
2017/02/06 SA3D HaCk3D Netherlands [...] Linux mirror
2017/02/06 Imam Netherlands [...] Linux mirror
2017/02/06 BALA SNIPER United States [...] Linux mirror
2016/11/12 jrb H Indonesia twitter.co.mz Linux mirror
2016/09/22 ..:<h1>:.. H United States twitter.net Linux mirror
2016/07/03 ByNemesis H Turkey twitterdukkan.com Linux mirror
2016/07/01 Hmei7 H R United States twitter.com Unknown mirror
2016/06/08 WoKaBoYa H United States [...] Linux mirror
2016/01/02 Akıncılar Turkey [...] Linux mirror
2015/12/27 ScAmA H United States www.twitter-ar.com Linux mirror
2015/09/24 Hacked By [...] H United States ikilltwitter.com Win 2003 mirror
答案 1 :(得分:0)
如果DataFrame中“ L”列中的值为空或无法用于检索所需的Location值,我将使用BeautifulSoup分析整个表,并迭代值<tr>
和{{ 1}},从头开始创建DataFrame。