从HTML中提取部分href

时间:2018-04-03 03:24:08

标签: python pandas

我正在寻找一种方法将部分href变成熊猫数据表。

</tbody>
  <tr class="rgRow" id="LeaderBoard1_dg1_ctl00__0">
      <td class="grid_line_regular" align="right">1</td>
      <td class="grid_line_regular">
          <a href="statss.aspx?playerid=11205&amp;position=OF">Adam Eaton</a>
      </td>
  <td class="grid_line_regular">
      <a href="leaders.aspx?pos=all&amp;stats=bat&amp;lg=all&amp;qual=0&amp;type=8&amp;season=2018&amp;month=0&amp;season1=2018&amp;ind=0&amp;team=24&amp;rost=0&amp;age=0">Nationals</a>
  </td>

任何人都可以帮我提取JUST&#34; playerid&#34;之后的数字字符。我设法从网站上搜集了数据,但如果没有玩家的相应ID,它就毫无价值。提前谢谢。

3 个答案:

答案 0 :(得分:1)

您需要一个HTML解析器来读取数据表和一个URL解析器来提取$ beeline -u jdbc:hive2://<hostip>:10000/tpch_sf100_orc -n rxxxds SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/apache-hive-2.3.0-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/apache-tez-0.9.0-bin/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/hadoop-2.9.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Connecting to jdbc:hive2://<hostip>:10000/tpch_sf100_orc 18/04/03 13:41:58 [main]: ERROR jdbc.HiveConnection: Error opening session org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000, use:database=tpch_sf100_orc}) at org.apache.thrift.TApplicationException.read(TApplicationException.java:111) ~[hive-exec-2.3.0.jar:2.3.0] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) ~[hive-exec-2.3.0.jar:2.3.0] at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_OpenSession(TCLIService.java:168) ~[hive-exec-2.3.0.jar:2.3.0] at org.apache.hive.service.rpc.thrift.TCLIService$Client.OpenSession(TCLIService.java:155) ~[hive-exec-2.3.0.jar:2.3.0] at org.apache.hive.jdbc.HiveConnection.openSession(HiveConnection.java:680) [hive-jdbc-2.3.0.jar:2.3.0] at org.apache.hive.jdbc.HiveConnection.<init>(HiveConnection.java:200) [hive-jdbc-2.3.0.jar:2.3.0] at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:107) [hive-jdbc-2.3.0.jar:2.3.0] at java.sql.DriverManager.getConnection(DriverManager.java:664) [?:1.8.0_112] at java.sql.DriverManager.getConnection(DriverManager.java:208) [?:1.8.0_112] at org.apache.hive.beeline.DatabaseConnection.connect(DatabaseConnection.java:145) [hive-beeline-2.3.0.jar:2.3.0] at org.apache.hive.beeline.DatabaseConnection.getConnection(DatabaseConnection.java:209) [hive-beeline-2.3.0.jar:2.3.0] at org.apache.hive.beeline.Commands.connect(Commands.java:1641) [hive-beeline-2.3.0.jar:2.3.0] at org.apache.hive.beeline.Commands.connect(Commands.java:1536) [hive-beeline-2.3.0.jar:2.3.0] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_112] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_112] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_112] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_112] at org.apache.hive.beeline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:56) [hive-beeline-2.3.0.jar:2.3.0] at org.apache.hive.beeline.BeeLine.execCommandWithPrefix(BeeLine.java:1274) [hive-beeline-2.3.0.jar:2.3.0] at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1313) [hive-beeline-2.3.0.jar:2.3.0] at org.apache.hive.beeline.BeeLine.connectUsingArgs(BeeLine.java:867) [hive-beeline-2.3.0.jar:2.3.0] at org.apache.hive.beeline.BeeLine.initArgs(BeeLine.java:776) [hive-beeline-2.3.0.jar:2.3.0] at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1010) [hive-beeline-2.3.0.jar:2.3.0] at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:519) [hive-beeline-2.3.0.jar:2.3.0] at org.apache.hive.beeline.BeeLine.main(BeeLine.java:501) [hive-beeline-2.3.0.jar:2.3.0] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_112] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_112] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_112] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_112] at org.apache.hadoop.util.RunJar.run(RunJar.java:239) [hadoop-common-2.9.0.jar:?] at org.apache.hadoop.util.RunJar.main(RunJar.java:153) [hadoop-common-2.9.0.jar:?] 18/04/03 13:41:58 [main]: WARN jdbc.HiveConnection: Failed to connect to <hostip>:10000 Error: Could not open client transport with JDBC Uri: jdbc:hive2://<hostip>:10000/tpch_sf100_orc: Could not establish connection to jdbc:hive2://<hostip>:10000/tpch_sf100_orc: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000, use:database=tpch_sf100_orc}) (state=08S01,code=0) Beeline version 2.3.0 by Apache Hive 的参数:

href

Beautiful Soup doc

Python3 urlparse doc

答案 1 :(得分:1)

在这里,您可以轻松解决问题(测试和工作):

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

# Put the url of your site here
url = "https://example.dev"
html = urlopen(url)
bs4 = BeautifulSoup(html, 'html.parser')
# In this line you find the first <a> tag that contain the 'playerid=' string in the href attribute
a = bs4.find('a', href=re.compile('(playerid=)'))
# In this line you get the link in the href attribute
link = a.attrs['href']
# In this line you operate on the link to get the ID
player_id = link.split('=')[1].split('&')[0]

如果您需要更多帮助,请与我联系!

答案 2 :(得分:0)

肯定有一个更有效的解决方案,但这应该为您提供有关如何解决此问题的基本想法。

import requests
from bs4 import BeautifulSoup

url = "https://.."

source_code = requests.get(url).text
soup = BeautifulSoup(source_code, 'lxml')

td_content = soup.find_all('td', {'class': 'grid_line_regular'})
playerids = []
for i in range(len(td_content)):
    link = td_content[i].find('a')['href'].strip()
    if(link and link[:6] == 'statss'):
        playerids.append(link[21:26])