什么是提取此数据的最佳方法

时间:2020-01-29 16:49:20

标签: python web-scraping

看着site,我想没有看到错误,因为每种本地语言(约鲁巴语) 含义翻译,并且有220种本地语言(约鲁巴语)

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

res = requests.get('http://yoruba.unl.edu/yoruba.php-text=1a&view=0&uni=0&l=1.htm')
soup = BeautifulSoup(res.content,'html.parser')

edu = {'Yoruba':[],'Translation':[],'Meaning':[]}
    # first loop
for br in soup.select('p > br:nth-of-type(1)'):
    text = br.previous_sibling.strip()
    edu['Yoruba'].append(text)
    # second loop
for br in soup.select('p > br:nth-of-type(2)'):
    text = br.previous_sibling
    if isinstance(text, str):
        edu['Translation'].append(text.strip())
    # third loop
for br in soup.select('p > br:nth-of-type(3)'):
    text = br.previous_sibling
    if isinstance(text, str):
        edu['Meaning'].append(re.sub(r'[\(\)]','',str(text.strip())))

df7 = pd.DataFrame(edu)

错误

ValueError: arrays must all be same length

1 个答案:

答案 0 :(得分:0)

由于三个键的长度各不相同,所以我认为解决此问题的最佳方法是将短键填充到最长键的长度(在这种情况下为220)。为此,请在创建数据框之前添加以下权限:

length = max(len(edu['Meaning']),len(edu['Translation']),len(edu['Yoruba'])) #in case you don't know, find the length of the longest key
for k in edu:
    for i in range(length-len(edu[k])):
        edu[k].append("NA") # this is where the padding is; you can replacing NA with anything else, obviously

df7 = pd.DataFrame.from_dict(edu) #since edu is a dictionary, I would use this method
df7

让我知道是否可行。