我正在尝试提取本陆军野战手册中列出的文本标题。我首先使用adobe acrobat将其转换为html文件:
http://usacac.army.mil/sites/default/files/misc/doctrine/CDG/cdg_resources/manuals/fm/fm7_15.pdf
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
url = 'C:/Users/.../fm7_15.html'
with open(url, "r") as ur:
html = ur.read()
soup = BeautifulSoup(html)
headers_30 = soup.find_all("p", attrs={"class":
"s30"})
headers_33 = soup.find_all("p", attrs={"class":
"s33"})
headers_20 = soup.find_all("p", attrs={"class":
"s20"})
df30 = pd.DataFrame(headers_30,columns=["column"])
df30.to_csv('headers_30.csv', index=False)
df33 = pd.DataFrame(headers_33,columns=["column"])
df33.to_csv('headers_33.csv', index=False)
df20 = pd.DataFrame(headers_20,columns=["column"])
df20.to_csv('headers_20.csv', index=False)
有3个类组成不同的标题(s30,s33,s20)。我设法将它们保存为csv,但问题是它还提取了所有相关的html标签。提取标题文本的最佳方法是什么?
答案 0 :(得分:2)
您可以使用列表推导从元素中提取文本:
headers_30 = [i.text for i in soup.find_all("p", {"class":"s30"})]
headers_33 = [i.text for i in soup.find_all("p", {"class":"s33"})]
headers_20 = [i.text for i in soup.find_all("p", {"class":"s20"})]
而不是:
headers_30 = soup.find_all("p", attrs={"class":"s30"})
headers_33 = soup.find_all("p", attrs={"class":"s33"})
headers_20 = soup.find_all("p", attrs={"class":"s20"})