Question

我正在使用Python中的Beautiful Soup来抓取玉米商品价格的数据。这是我的代码，只是如何获取数据：

import urllib2
import requests
from bs4 import BeautifulSoup
import codecs

url="http://online.wsj.com/mdc/public/page/2_3020-cashprices-20170320.html"
r=requests.get(url)
soup=BeautifulSoup(r.content, "lxml")
soup.title
f=open('corny.txt', 'w')
commodity = soup.findAll(attrs={"class":"text"})
print commodity[51]
commo = commodity[51].string
print commo
#Corn, No. 2 yellow. Cent. Ill. bu-BP,U (success!!)
f.write(commo)
corndate = soup.findAll("span")
print corndate[16]
cdate = corndate[16].string
print cdate
f.write(cdate)
price = soup.findAll("b")
print price[46]
pricey = price[46].string
print pricey
f.write(pricey)
f.close()

问题是我需要在2005年到现在的每一天都这样做，但是标签的顺序发生变化，所以我不能保持相同的代码（例如，有一天第51个attrs = {“class “：”text“}适用于Corn，但是对于一周后的另一天它是棉花之类的东西。我需要编写代码，以便通过文本文件输出仅玉米的日期和价格（周三价格）（玉米，没有。 2黄.Cent.Ill.bu-BP，U）。

此外，URL结构似乎比我能理解的更复杂。

Answer 1

按tr标记提取，然后搜索哪个元素包含字符串"Corn, No. 2 yellow"。然后从那里得到价格。

url="http://online.wsj.com/mdc/public/page/2_3020-cashprices-20170320.html"
r=requests.get(url)
soup=BeautifulSoup(r.content, "lxml")


corndate = soup.find_all("span")
cdate = corndate[16].string
print (cdate)

corn_name = ""
corn_price = ""

corn_info = soup.find_all("tr")
for corn in corn_info:
    text = corn.get_text()
    if(text.find("Corn, No. 2 yellow") > -1):
        text = text.replace("\n\n", "\n", 10)
        text = text.strip("\n")
        all_text = text.split("\n")
        corn_name = all_text[0]
        corn_price = all_text[3]
        break

file = open("Corn Info.txt", "a")
file.write(corn_name + "\n")
file.write(corndate + "\n")
file.write(corn_price + "\n")
file.close()

网络刮痧WSJ玉米价格与美丽的汤

1 个答案: