我有以下刮刀草稿:
from lxml import html
import requests
import sys
requestedURL = sys.argv[1]
page = requests.get(requestedURL)
tree = html.fromstring(page.content)
passage = ''
for tr in tree.cssselect("div [class='passage-content passage-class-0']"):
for each in tr:
for e in each:
for x in e:
if x.text_content() == 'Footnotes:' or x.text_content() == 'Cross references:':
passage += '\n'
passage = passage.lstrip('\n')
sys.stdout.write(passage)
sys.exit(0)
if not x.text_content()[0].isdigit():
passage += '\n\n'+x.text_content()+'\n\n'
else:
passage += x.text_content()
passage = passage.replace('\n\n\n', '\n\n')
当我运行它时,我确实得到了我想要的输出,但我也得到了两个不需要的事件:
Enter
示例:
python bg_scrape.py https://www.biblegateway.com/passage/?search=John+3%3A1&version=ESV
[1] 48648
John 3:1
New International Version (NIV)
Jesus Teaches Nicodemus
3 Now there was a Pharisee, a man named Nicodemus who was a member of the Jewish ruling council.
// this line doesn't show up until I hit enter
[1]+ Done python bg_scrape.py https://www.biblegateway.com/passage/?search=John+3%3A1
值得注意的是,一旦我将requestedURL
作为sys.arg
而不是代码中的静态字符串,这种情况才会开始发生。
答案 0 :(得分:1)
可能是“&”在cmd行参数中。尝试将参数放在双引号python bg_scrape.py "https://www.biblegateway.com/passage/?search=John+3%3A1&version=ESV"
基本上发生的事情是你的shell实际上运行了两件事:
python bg_scrape.py https://www.biblegateway.com/passage/?search=John+3%3A1
作为后台流程version=ESV
,它分配一个shell变量当你按回车键时,shell只会给你一个已完成的任何后台进程的更新(在这种情况下,你刚开始的那个)。