Python HTML parsing: removing excess HTML from get request output

时间:2018-06-04 16:52:17

标签: python html parsing web-scraping

I am wanting to make a simple python script to automate the process of pulling .mov files from an IP camera's SD card. The Model of IP camera supports http requests which returns HTML that contains the .mov file info. My python script so far..

from bs4 import BeautifulSoup
import requests
page = requests.get("http://192.168.1.99/form/getStorageFileList?type=3")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

OUTPUT:

NAME2041=Record_continiously/2018-06-02/8/MP_2018-06-03_00-33-15_60.mov

I want to only return the MOV file. So removing:

"NAME2041=Record_continiously/2018-06-02/8/"

I'm new to HTML parsing with python so I'm a bit confused with the functionality.

Is returned HTML considered a string? If so, I understand that it will be immutable and I will have to create a new string instead of "striping away" the preexisting string.

I have tried:

page.replace("NAME2041=Record_continiously/2018-06-02/8/","")

in which I receive an attribute error. Is anyone aware of any method that could accomplish this?

Here is a sample of the HTML I am working with...

<html>
<head></head>
<body>
000 Success NUM=2039 NAME0=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-17-38_60.mov SIZE0=15736218 
NAME1=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-16-37_60.mov SIZE1=15683077
NAME2=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-15-36_60.mov SIZE2=15676882
NAME3=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-14-35_60.mov SIZE3=15731539 
</body>
</html>

2 个答案:

答案 0 :(得分:0)

Use str.split with negative indexing.

Ex:

page = "NAME2041=Record_continiously/2018-06-02/8/MP_2018-06-03_00-33-15_60.mov"
print( page.split("/")[-1])

Output:

MP_2018-06-03_00-33-15_60.mov

答案 1 :(得分:0)

as you asked for explanation of your code here it is:

# import statements
from bs4 import BeautifulSoup  
import requests

page = requests.get("http://192.168.1.99/form/getStorageFileList?type=3")  # returns response object
soup = BeautifulSoup(page.content, 'html.parser')  # 

page.content returns string content of response

you are passing this(page.content) string content to class BeautifulSoup which is initialized with two arguments your content(page.content) as string and parser here it is html.parser

soup is the object of BeautifulSoup

.prettify() is method used to pretty print the content

In string slicing you may get failure of result due to length of content so it's better to split your content as suggested by @Rakesh and that's the best approach in your case.