我正在尝试解析http://agmarknet.nic.in/的商品定价数据并尝试将其存储在我的数据库中。
我以 Ambala Cantt的形式获取数据。 1.2 Bitter Gourd 1200 2000 1500 然后我将它拆分为split()并将其存储在DB中。但是一些名称之间的空格()它们的名称之间的split()也将它分开并将其分解为:
['Ambala' ,'Cantt.', '1.2', 'Bitter', 'Gourd', '1200', '2000', '1500']
但我希望它像:
['Ambala Cantt.', '1.2', 'Bitter Gourd', '1200', '2000', '1500']
我正在为每个循环迭代数据然后拆分它。为了解决这个问题,我尝试了正则表达式
([c.strip() for c in re.match(r"""
(?P<market>[^0-9]+)
(?P<arrivals>[^ ]+)
(?P<variety>[^0-9]+)
(?P<min>[0-9]+)
\ (?P<max>[0-9]+)
\ (?P<modal>[0-9]+)""",
example,
re.VERBOSE
).groups()])
如果我写 example =&#34; Ambala Cantt,上面的代码块工作正常。 1.2 Bitter Gourd 1200 2000 1500&#34; 但是如果你把它放在每个循环里面,例如在y中:
([c.strip() for c in re.match(r"""
(?P<market>[^0-9]+)
(?P<arrivals>[^ ]+)
(?P<variety>[^0-9]+)
(?P<min>[0-9]+)
\ (?P<max>[0-9]+)
\ (?P<modal>[0-9]+)""",
example,
re.VERBOSE
).groups()])
。我收到属性错误** re.VERBOSE AttributeError:&#39; NoneType&#39;对象没有属性&#39; 。我的代码看起来像这样
params = urllib.urlencode({'cmm': 'Bitter gourd', 'mkt': '', 'search': ''})
headers = {'Cookie': 'ASPSESSIONIDCCRBQBBS=KKLPJPKCHLACHBKKJONGLPHE; ASP.NET_SessionId=kvxhkhqmjnauyz55ult4hx55; ASPSESSIONIDAASBRBAS=IEJPJLHDEKFKAMOENFOAPNIM','Origin': 'http://agmarknet.nic.in', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en-GB,en-US;q=0.8,en;q=0.6','Upgrade-Insecure-Requests': '1','User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36', 'Content-Type': 'application/x-www-form-urlencoded','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Cache-Control': 'max-age=0','Referer': 'http://agmarknet.nic.in/mark2_new.asp','Connection': 'keep-alive'}
conn = httplib.HTTPConnection("agmarknet.nic.in")
conn.request("POST", "/SearchCmmMkt.asp", params, headers)
response = conn.getresponse()
data = response.read()
soup = bs(data, "html.parser")
#print dir(soup)
z = []
y = []
w = []
x1 = []
test = []
trs = soup.findAll("tr")
for tr in trs:
c = unicodedata.normalize('NFKD', tr.text)
y.append(str(c))
for x in y:
#data1 = "Ambala 1.2 Onion 1200 2000 1500"
x1 = ([c.strip() for c in re.match(r"""
(?P<market>[^0-9]+)
(?P<arrivals>[^ ]+)
(?P<variety>[^0-9]+)
(?P<min>[0-9]+)
\ (?P<max>[0-9]+)
\ (?P<modal>[0-9]+)""",
x,
re.VERBOSE
).groups()])
print x1.
任何人都可以帮助我如何以 [&#39; Ambala Cantt。&#39;,&#39; 1.2&#39;,&#39; Bitter Gourd&#)的形式获取我的数据39;,&#39; 1200&#39;,&#39; 2000&#39;,&#39; 1500&#39;] 而不是将其作为[&#39; Ambala&#39; ,&#39; Cantt。&#39;,&#39; 1.2&#39;,&#39; Bitter&#39;,&#39; Gourd&#39;,&#39; 1200&#39;,& #39; 2000&#39;,&#39; 1500&#39;]。
答案 0 :(得分:1)
use shlex module
import shlex
l = "Ambala Cantt. 1.2 Bitter Gourd 1200 2000 1500"
# first put quotes around word pairs
l = re.sub(r'([A-Z]\w+\s+\w+)',r'"\1"',l)
# then split with shlex, it will not split inside the quoted strings
l = shlex.split(l)
['Ambala Cantt.', '1.2', 'Bitter Gourd', '1200', '2000', '1500']
你可以把它作为一个班轮运行:
result = shlex.split(re.sub(r'([A-Z]\w+\s+\w+)',r'"\1"',"Ambala Cantt. 1.2 Bitter Gourd 1200 2000 1500"))