看着site,我想没有看到错误,因为每种本地语言(约鲁巴语) 含义和翻译,并且有220种本地语言(约鲁巴语)。
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
res = requests.get('http://yoruba.unl.edu/yoruba.php-text=1a&view=0&uni=0&l=1.htm')
soup = BeautifulSoup(res.content,'html.parser')
edu = {'Yoruba':[],'Translation':[],'Meaning':[]}
# first loop
for br in soup.select('p > br:nth-of-type(1)'):
text = br.previous_sibling.strip()
edu['Yoruba'].append(text)
# second loop
for br in soup.select('p > br:nth-of-type(2)'):
text = br.previous_sibling
if isinstance(text, str):
edu['Translation'].append(text.strip())
# third loop
for br in soup.select('p > br:nth-of-type(3)'):
text = br.previous_sibling
if isinstance(text, str):
edu['Meaning'].append(re.sub(r'[\(\)]','',str(text.strip())))
df7 = pd.DataFrame(edu)
错误
ValueError: arrays must all be same length
答案 0 :(得分:0)
由于三个键的长度各不相同,所以我认为解决此问题的最佳方法是将短键填充到最长键的长度(在这种情况下为220)。为此,请在创建数据框之前添加以下权限:
length = max(len(edu['Meaning']),len(edu['Translation']),len(edu['Yoruba'])) #in case you don't know, find the length of the longest key
for k in edu:
for i in range(length-len(edu[k])):
edu[k].append("NA") # this is where the padding is; you can replacing NA with anything else, obviously
df7 = pd.DataFrame.from_dict(edu) #since edu is a dictionary, I would use this method
df7
让我知道是否可行。