我正在尝试将从网站抓取的信息添加到列中。我有一个数据集,看起来像:
COL1 COL2 COL3
... ... bbc.co.uk
,我想拥有一个包含新列的数据集:
COL1 COL2 COL3 Website Address Last Analysis Blacklist Status \
... ... bbc.co.uk
IP Address Server Location City Region
这些新列来自以下网站:https://www.urlvoid.com/scan/bbc.co.uk。 我需要在每一列中填写相关信息。
例如:
COL1 COL2 COL3 Website Address Last Analysis Blacklist Status \
... ... bbc.co.uk Bbc.co.uk 9 days ago 0/35
Domain Registration IP Address Server Location City Region
1996-08-01 | 24 years ago 151.101.64.81 (US) United States Unknown Unknown
不幸的是,在创建新列并用从网站上抓取的信息填充它们时,我遇到了一些问题。我可能要检查更多的网站,不仅是bbc.co.uk。 请参见下面使用的代码。我敢肯定,有一种更好的方法(而不是混乱的方法)可以做到这一点。 如果您能帮助我解决问题,我将不胜感激。谢谢
编辑:
如上面的示例所示,对于包含三列(col1, col2 and col3
)的现有数据集,我还应该添加来自抓取(Website Address,Last Analysis,Blacklist Status, ...
)的字段。那么,对于每个网址,我都应该具有与之相关的信息(例如示例中的bbc.co.uk
)。
COL1 COL2 COL3 Website Address Last Analysis Blacklist Status \
... ... bbc.co.uk Bbc.co.uk 9 days ago 0/35
... ... stackoverflow.com
... ... ...
IP Address Server Location City Region
COL1 COL2 COL3 Website Address Last Analysis Blacklist Status \
... ... bbc.co.uk Bbc.co.uk 9 days ago 0/35
... ... stackoverflow.com Stackoverflow.com 7 days ago 0/35
Domain Registration IP Address Server Location ...
996-08-01 | 24 years ago 151.101.64.81 (US) United States ...
2003-12-26 | 17 years ago ...
(格式不好,但我想它足以让您了解预期的输出)。
更新的代码:
urls= ['bbc.co.uk', 'stackoverflow.com', ...]
for x in urls:
print(x)
r = requests.get('https://www.urlvoid.com/scan/'+x)
soup = BeautifulSoup(r.content, 'lxml')
tab = soup.select("table.table.table-custom.table-striped")
dat = tab[0].select('tr')
for d in dat:
row = d.select('td')
original_dataset[row[0].text]=row[1].text
不幸的是,我做错了什么,因为它只复制了新列下所有行中网站(即bbc.co.uk)上检查的第一个URL的信息。
答案 0 :(得分:0)
您可以通过使用pandas read_html方法来使用更简单的方法来获取数据。这是我的镜头-
import pandas as pd
df = pd.read_html("https://www.urlvoid.com/scan/bbc.co.uk/")[0]
df_transpose = df.T
现在您拥有必需的转置数据。您可以根据需要删除不需要的列。之后,您现在要做的就是将其与现有数据集结合起来。考虑到您可以将数据集作为pandas数据框加载,您可以为此简单地使用concat函数(axis = 1可以串联为列):
pd.concat([df_transpose, existing_dataset], axis=1)
有关合并/级联的信息,请参见pandas文档:http://pandas.pydata.org/pandas-docs/stable/merging.html
答案 1 :(得分:0)
让我知道这是否是您要寻找的东西
cols = ['Col1','Col2']
rows = ['something','something else']
my_df= pd.DataFrame(rows,index=cols).transpose()
my_df
从此行拾取您现有的代码:
dat = tab[0].select('tr')
添加:
for d in dat:
row = d.select('td')
my_df[row[0].text]=row[1].text
my_df
输出(对不起格式):
Col1 Col2 Website Address Last Analysis Blacklist Status Domain Registration Domain Information IP Address Reverse DNS ASN Server Location Latitude\Longitude City Region
0 something something else Bbc.com 11 days ago | Rescan 0/35 1989-07-15 | 31 years ago WHOIS Lookup | DNS Records | Ping 151.101.192.81 Find Websites | IPVoid | ... Unknown AS54113 FASTLY (US) United States 37.751 / -97.822 Google Map Unknown Unknown
编辑:
要使用多个网址,请尝试以下操作:
urls = ['bbc.com', 'stackoverflow.com']
ares = []
for u in urls:
url = 'https://www.urlvoid.com/scan/'+u
r = requests.get(url)
ares.append(r)
rows = []
cols = []
for ar in ares:
soup = bs(ar.content, 'lxml')
tab = soup.select("table.table.table-custom.table-striped")
dat = tab[0].select('tr')
line= []
header=[]
for d in dat:
row = d.select('td')
line.append(row[1].text)
new_header = row[0].text
if not new_header in cols:
cols.append(new_header)
rows.append(line)
my_df = pd.DataFrame(rows,columns=cols)
my_df
输出:
Website Address Last Analysis Blacklist Status Domain Registration Domain Information IP Address Reverse DNS ASN Server Location Latitude\Longitude City Region
0 Bbc.com 12 days ago | Rescan 0/35 1989-07-15 | 31 years ago WHOIS Lookup | DNS Records | Ping 151.101.192.81 Find Websites | IPVoid | ... Unknown AS54113 FASTLY (US) United States 37.751 / -97.822 Google Map Unknown Unknown
1 Stackoverflow.com 5 minutes ago | Rescan 0/35 2003-12-26 | 17 years ago WHOIS Lookup | DNS Records | Ping 151.101.1.69 Find Websites | IPVoid | Whois Unknown AS54113 FASTLY (US) United States 37.751 / -97.822 Google Map Unknown Unknown
请注意,这没有您现有的两个列(因为我不知道它们是什么),因此您必须将它们分别附加到数据框。