仅使用类获取表数据,而不是所有表数据

时间:2020-04-16 03:18:49

标签: python web-scraping beautifulsoup

我正在尝试抓取一个网站,循环访问以仅获取状态名称,而不是获取表数据中的所有类。但是当我遍历所有表数据时,是否有排除td类的方法?

[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ todo ---
[INFO] Installing /home/admin/cloned-apps/todo/pom.xml to /home/admin/.m2/repository/com/example/todo/0.0.2-SNAPSHOT/todo-0.0.2-SNAPSHOT.pom
[INFO]
[INFO] --- maven-deploy-plugin:2.8.2:deploy (default-deploy) @ todo ---
Downloading from my-maven-snapshots: http://my-nexus-server/nexus/repository/maven-snapshots/com/example/todo/0.0.2-SNAPSHOT/maven-metadata.xml
[WARNING] Could not transfer metadata com.example:todo:0.0.2-SNAPSHOT/maven-metadata.xml from/to my-maven-snapshots (http://my-nexus-server/nexus/repository/maven-snapshots/): Not authorized , ReasonPhrase:Unauthorized.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for todo 0.0.2-SNAPSHOT:
[INFO]
[INFO] todo ............................................... FAILURE [  4.622 s]
[INFO] todo-webapp ........................................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  5.973 s
[INFO] Finished at: 2020-04-16T02:39:48Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-deploy-plugin:2.8.2:deploy (default-deploy) on project todo: Failed to retrieve remote metadata com.example:todo:0.0.2-SNAPSHOT        /maven-metadata.xml: Could not transfer metadata com.example:todo:0.0.2-SNAPSHOT/maven-metadata.xml from/to my-maven-snapshots (http://my-nexus-server/nexus/repository/maven-snapshots/        ): Not authorized , ReasonPhrase:Unauthorized. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

3 个答案:

答案 0 :(得分:0)

您必须使用strip()功能。像这样

 page = requests.get("https://www.theguardian.com/world/ng- 
 interactive/2020/apr/13/coronavirus-map-us-latest-covid-19-cases-state-by-state")
 soup = BeautifulSoup(page.content, 'html.parser')
 state_table = soup.find(id='co-table-container')
 item_cases = state_table.find(class_='co-table')
 data = []

 for tr in item_cases.find_all("tr"):
    _class = tr.get("class")
    myData = {}

    #skip table body header
    if _class is not None and "thead" == _class[0]:
        continue

    #scrap details
    for td in tr.find_all("td"):
        myData[td['data-stat']] = td.text.strip()

        """ add if condition like if you want to exclude any specific 

            # find about td['data-stat'] when you inspect in table row columns
            if td['data-stat'] == "name":
                myData['name'] = td.text.strip()

        """

    data.append(myData)

print(data)

答案 1 :(得分:0)

这有效:

response = requests.get('https://www.theguardian.com/world/ng-interactive/2020/apr/13/coronavirus-map-us-latest-covid-19-cases-state-by-state')

soup = BeautifulSoup(response.text, 'html.parser')

# get all rows from the table with class: co-table
rows = soup.find('table', {'class':'co-table'}).findAll('tr')

# loop through each row to get the state name
# note that we skip first row; we don't want the heading
for i in range(1, len(rows)):
    print(rows[i].td.text)

答案 2 :(得分:0)

您可以只使用pandas函数read_html()。它在后台使用beautifulsoup来解析html中的<table>标签。

import pandas as pd

url = 'https://www.theguardian.com/world/ng-interactive/2020/apr/13/coronavirus-map-us-latest-covid-19-cases-state-by-state'
df = pd.read_html(url)[0].dropna(axis=1)

print (list(df['State/territory']))

输出:

['New York', 'New Jersey', 'Massachusetts', 'Michigan', 'California', 'Pennsylvania', 'Illinois', 'Florida', 'Louisiana', 'Texas', 'Georgia', 'Connecticut', 'Washington', 'Maryland', 'Indiana', 'Colorado', 'Ohio', 'Virginia', 'Tennessee', 'North Carolina', 'Missouri', 'Alabama', 'Arizona', 'Wisconsin', 'South Carolina', 'Rhode Island', 'Mississippi', 'Nevada', 'Utah', 'Kentucky', 'Oklahoma', 'District of Columbia', 'Delaware', 'Iowa', 'Minnesota', 'Oregon', 'Arkansas', 'Kansas', 'Idaho', 'New Mexico', 'South Dakota', 'New Hampshire', 'Puerto Rico', 'Nebraska', 'Maine', 'Vermont', 'West Virginia', 'Hawaii', 'Montana', 'North Dakota', 'Alaska', 'Wyoming', 'Guam', 'Virgin Islands']