从挑战性网站获取信息,没有指导性的HTML结构

时间:2016-05-27 04:54:03

标签: python regex web-scraping beautifulsoup

我需要从一个非常具有挑战性的网站上抓取一些信息

这是一个例子:

function additional_lave_annual($userID){

$resul=mysql_query("SELECT * from employee where emp_id='$userID'") or die("query error".mysql_error());
$result3 = mysql_fetch_array($resul);
$emp_division=$result3['division'];
$emp_date=$result3['employment_date'];

$resu=mysql_query("SELECT * from additional_leave where leave_type='Annual' AND (division='$emp_division' OR division='all_dpt') ORDER BY additional_id DESC limit 1") or die("query error".mysql_error());
$result3 = mysql_fetch_array($resu);
$added_date=$result3['added_date'];

if($added_date>=$emp_date){

    $number_of_datee= "SELECT SUM(number_of_date) AS number_of_date FROM additional_leave WHERE  division IN ('$emp_division','all_dpt') AND leave_type='Annual' AND added_date>='$emp_date'";
    $res_qry22     = mysql_query($number_of_datee) or die ('Invalid query :: <br/>'.$number_of_datee.' <br/>'.mysql_error());

    $rowqr = mysql_fetch_assoc($res_qry22);
    $number_date33 = $rowqr ['number_of_date'];

    $result= "UPDATE employee SET annual_additional='$number_date33' WHERE emp_id='$userID'";
    $query=mysql_query($result);

    $resul=mysql_query("SELECT * from employee where emp_id='$userID'") or die("query error".mysql_error());
    $result3 = mysql_fetch_array($resul);
    $annual_additional=$result3['annual_additional'];

    $result= "UPDATE employee SET annual_additional='$number_date33' WHERE emp_id='$userID'";
    $query=mysql_query($result);


if (additional_leave_apply($userID)){

    echo ($annual_additional);?>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    <?php echo  " <a href='apply_additional_leave.php'><button type='button' class='btn btn-info btn-xs'>Apply this leave</button></a>";

}
else{
    echo ($annual_additional);?>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    <?php echo  " <a href='apply_additional_leave.php'><button type='button' class='btn btn-info btn-xs'>Apply this leave</button></a>";
}

每个课程都有特定的学生,他们的年龄是在他们的名字后面给出的(那些随机字符已经在那里)。

我需要和他们各自的学生一起学习每门课程,再加上年龄。

不幸的是,除了包罗万象的div级之外,没有固有的层次结构。我试着通过#34; course_name&#34;和BeautifulSoup进行拼抢。然后添加具有&#34; coursestudent_name&#34;的所有项目。属性,但这样我就会将所有在场的学生添加到每门课程中。

我希望我能改变网站,但我不能。任何人都知道如何通过正确的学生获得每门课程的信息?

谢谢!

3 个答案:

答案 0 :(得分:1)

你可以使用BeautifulSoup然后使用一小部分正则表达式来获得不在任何html标签内的学生年龄

soup = BeautifulSoup(html, "html.parser")
allA = soup.find("div", {"class" : "overview"}).find_all("a")

classInfo = {}
currentClass = None
for item in allA:
    if item['class'] == ['course_name']:
        classInfo[item.text] = []
        currentClass = item.text
    else:
        classInfo[currentClass] += [(item.text, int(re.search(item.text + r"</a> (\d+)", html).group(1)))]


print(classInfo)

输出:

{'English101': [('Sarah', 16), ('Nancy', 17), ('Casey', 17)], 'Math101': [('Mark', 17), ('Alex', 18)]}

答案 1 :(得分:0)

如果您可以修改您的问题,请告诉我们您正在寻找的内容。但是,这是一个如何从此页面获取数据的基本示例。

from bs4 import BeautifulSoup
import re

html = '''<div class="overview">
        <span class="course_titles">Courses:</span> 
        <a href="/schools/courses/173/" class="course_name">Math101</a> (Math; Monday; Room 10);
        <a href="/schools/student/1388/" class="coursestudent_name">Mark</a> 17, 
        <a href="/schools/student/1401/" class="coursestudent_name">Alex</a> 18, ), 

        <a href="/schools/courses/2693/" class="course_name">English101</a> (English; Thursdays; Room 12); 
        <a href="/schools/student/1403/" class="coursestudent_name">Sarah</a> 16, 
        <a href="/schools/student/1411/" class="coursestudent_name">Nancy</a> 17, 
        <a href="/schools/student/1390/" class="coursestudent_name">Casey</a> 17 ), 
</div>'''

soup = BeautifulSoup(html)

all_links = soup.find_all('a')

dict_courseinfo = {}
dict_key = ''
stu_lst = []

for n, link in enumerate(all_links):
    if link.get('class')[0] == 'course_name':
        if n > 0:
            dict_courseinfo[dict_key] = stu_lst
            stu_lst = []
        dict_key = str(link.text)
    else:
        age = int(re.search(link.text + r"</a> (\d+)", html).group(1))
        stu_lst.append((str(link.text), age))

dict_courseinfo[dict_key] = stu_lst

print dict_courseinfo

将输出:

{'Math101': [('Mark', 17), ('Alex', 18)], 'English101': [('Sarah', 16), ('Nancy', 17), ('Casey', 17)]}

答案 2 :(得分:0)

您不需要正则表达式,您可以简单地解析锚标记以获取名称并调用next_sibling以获取年龄文本拆分和剥离以获取年龄文本,找到course_name coursestudent之前的版本也会为您提供相关课程:

h = """<div class="overview">
        <span class="course_titles">Courses:</span>
        <a href="/schools/courses/173/" class="course_name">Math101</a> (Math; Monday; Room 10);
        <a href="/schools/student/1388/" class="coursestudent_name">Mark</a> 17,
        <a href="/schools/student/1401/" class="coursestudent_name">Alex</a> 18, ),

        <a href="/schools/courses/2693/" class="course_name">English101</a> (English; Thursdays; Room 12);
        <a href="/schools/student/1403/" class="coursestudent_name">Sarah</a> 16,
        <a href="/schools/student/1411/" class="coursestudent_name">Nancy</a> 17,
        <a href="/schools/student/1390/" class="coursestudent_name">Casey</a> 17 ),
</div>"""

from bs4  import BeautifulSoup
soup = BeautifulSoup(h)


data = [[a.find_previous("a", "course_name").text ,a.text, a.next_sibling.split()[0].strip(",")] for a in soup.select("div.overview a.coursestudent_name")]

 [[u'Math101', u'Mark', u'17'], [u'Math101', u'Alex', u'18'], [u'English101', u'Sarah', u'16'], [u'English101', u'Nancy', u'17'], [u'English101', u'Casey', u'17']]