我正在寻找一种方法将部分href变成熊猫数据表。
</tbody>
<tr class="rgRow" id="LeaderBoard1_dg1_ctl00__0">
<td class="grid_line_regular" align="right">1</td>
<td class="grid_line_regular">
<a href="statss.aspx?playerid=11205&position=OF">Adam Eaton</a>
</td>
<td class="grid_line_regular">
<a href="leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2018&month=0&season1=2018&ind=0&team=24&rost=0&age=0">Nationals</a>
</td>
任何人都可以帮我提取JUST&#34; playerid&#34;之后的数字字符。我设法从网站上搜集了数据,但如果没有玩家的相应ID,它就毫无价值。提前谢谢。
答案 0 :(得分:1)
您需要一个HTML解析器来读取数据表和一个URL解析器来提取$ beeline -u jdbc:hive2://<hostip>:10000/tpch_sf100_orc -n rxxxds
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache-hive-2.3.0-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/apache-tez-0.9.0-bin/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.9.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://<hostip>:10000/tpch_sf100_orc
18/04/03 13:41:58 [main]: ERROR jdbc.HiveConnection: Error opening session
org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000, use:database=tpch_sf100_orc})
at org.apache.thrift.TApplicationException.read(TApplicationException.java:111) ~[hive-exec-2.3.0.jar:2.3.0]
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) ~[hive-exec-2.3.0.jar:2.3.0]
at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_OpenSession(TCLIService.java:168) ~[hive-exec-2.3.0.jar:2.3.0]
at org.apache.hive.service.rpc.thrift.TCLIService$Client.OpenSession(TCLIService.java:155) ~[hive-exec-2.3.0.jar:2.3.0]
at org.apache.hive.jdbc.HiveConnection.openSession(HiveConnection.java:680) [hive-jdbc-2.3.0.jar:2.3.0]
at org.apache.hive.jdbc.HiveConnection.<init>(HiveConnection.java:200) [hive-jdbc-2.3.0.jar:2.3.0]
at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:107) [hive-jdbc-2.3.0.jar:2.3.0]
at java.sql.DriverManager.getConnection(DriverManager.java:664) [?:1.8.0_112]
at java.sql.DriverManager.getConnection(DriverManager.java:208) [?:1.8.0_112]
at org.apache.hive.beeline.DatabaseConnection.connect(DatabaseConnection.java:145) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.DatabaseConnection.getConnection(DatabaseConnection.java:209) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.Commands.connect(Commands.java:1641) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.Commands.connect(Commands.java:1536) [hive-beeline-2.3.0.jar:2.3.0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_112]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_112]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_112]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_112]
at org.apache.hive.beeline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:56) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.BeeLine.execCommandWithPrefix(BeeLine.java:1274) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1313) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.BeeLine.connectUsingArgs(BeeLine.java:867) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.BeeLine.initArgs(BeeLine.java:776) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1010) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:519) [hive-beeline-2.3.0.jar:2.3.0]
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:501) [hive-beeline-2.3.0.jar:2.3.0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_112]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_112]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_112]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_112]
at org.apache.hadoop.util.RunJar.run(RunJar.java:239) [hadoop-common-2.9.0.jar:?]
at org.apache.hadoop.util.RunJar.main(RunJar.java:153) [hadoop-common-2.9.0.jar:?]
18/04/03 13:41:58 [main]: WARN jdbc.HiveConnection: Failed to connect to <hostip>:10000
Error: Could not open client transport with JDBC Uri: jdbc:hive2://<hostip>:10000/tpch_sf100_orc: Could not establish connection to jdbc:hive2://<hostip>:10000/tpch_sf100_orc: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000, use:database=tpch_sf100_orc}) (state=08S01,code=0)
Beeline version 2.3.0 by Apache Hive
的参数:
href
答案 1 :(得分:1)
在这里,您可以轻松解决问题(测试和工作):
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
# Put the url of your site here
url = "https://example.dev"
html = urlopen(url)
bs4 = BeautifulSoup(html, 'html.parser')
# In this line you find the first <a> tag that contain the 'playerid=' string in the href attribute
a = bs4.find('a', href=re.compile('(playerid=)'))
# In this line you get the link in the href attribute
link = a.attrs['href']
# In this line you operate on the link to get the ID
player_id = link.split('=')[1].split('&')[0]
如果您需要更多帮助,请与我联系!
答案 2 :(得分:0)
肯定有一个更有效的解决方案,但这应该为您提供有关如何解决此问题的基本想法。
import requests
from bs4 import BeautifulSoup
url = "https://.."
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, 'lxml')
td_content = soup.find_all('td', {'class': 'grid_line_regular'})
playerids = []
for i in range(len(td_content)):
link = td_content[i].find('a')['href'].strip()
if(link and link[:6] == 'statss'):
playerids.append(link[21:26])