Question

看着site，我想没有看到错误，因为每种本地语言（约鲁巴语） 含义和翻译，并且有220种本地语言（约鲁巴语）。

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

res = requests.get('http://yoruba.unl.edu/yoruba.php-text=1a&view=0&uni=0&l=1.htm')
soup = BeautifulSoup(res.content,'html.parser')

edu = {'Yoruba':[],'Translation':[],'Meaning':[]}
    # first loop
for br in soup.select('p > br:nth-of-type(1)'):
    text = br.previous_sibling.strip()
    edu['Yoruba'].append(text)
    # second loop
for br in soup.select('p > br:nth-of-type(2)'):
    text = br.previous_sibling
    if isinstance(text, str):
        edu['Translation'].append(text.strip())
    # third loop
for br in soup.select('p > br:nth-of-type(3)'):
    text = br.previous_sibling
    if isinstance(text, str):
        edu['Meaning'].append(re.sub(r'[\(\)]','',str(text.strip())))

df7 = pd.DataFrame(edu)

错误

ValueError: arrays must all be same length

Answer 1

由于三个键的长度各不相同，所以我认为解决此问题的最佳方法是将短键填充到最长键的长度（在这种情况下为220）。为此，请在创建数据框之前添加以下权限：

length = max(len(edu['Meaning']),len(edu['Translation']),len(edu['Yoruba'])) #in case you don't know, find the length of the longest key
for k in edu:
    for i in range(length-len(edu[k])):
        edu[k].append("NA") # this is where the padding is; you can replacing NA with anything else, obviously

df7 = pd.DataFrame.from_dict(edu) #since edu is a dictionary, I would use this method
df7

让我知道是否可行。

什么是提取此数据的最佳方法

1 个答案: