我想从.cdp文件中提取信息(内容下载文件,解析程序,可以在记事本中打开)文件看起来像:
....
...
<CD_PARSING_RB_9>0</CD_PARSING_RB_9>
<CD_PARSING_RB_F9_3>0</CD_PARSING_RB_F9_3>
<CD_PARSING_LB_1>http://www.prospect.chisites.net/opportunities/?pageno=1
http://www.prospect.chisites.net/opportunities/?pageno=2
http://www.prospect.chisites.net/opportunities/?pageno=3
http://www.prospect.chisites.net/opportunities/?pageno=4</CD_PARSING_LB_1>
<CD_PARSING_EDIT_26>0</CD_PARSING_EDIT_26>
<CD_PARSING_EDIT_27><a href=/jobs/</CD_PARSING_EDIT_27>
<CD_PARSING_EDIT_28>0</CD_PARSING_EDIT_28>
我想使用python提取链接,我找到了一些解决方案,但它部分工作。 (只删除<CD_PARSING_LB1>
标签),它应删除除这两个标签之间的链接之外的所有内容。解决方案也可能使用搜索,但这个因某些原因不起作用。
代码:
import string
import codecs
import re
import glob
outfile = open('newout.txt', 'w+')
try:
for file in glob.glob("*.cdp"):
print(file)
infile = open(file, 'r')
step1 = re.sub('.*<CD_PARSING_LB_1>', '',infile.read(), re.DOTALL)
step2 = re.sub('</CD_PARSING_LB_1>.*','', step1, re.DOTALL)
outfile.write(str(step1))
except Exception as ex:
print ex
raw_input()
请以任何方式帮助我将这些链接分开......谢谢 完整文件示例:
Content Downloader X1 (11.9940) project file (parsing)
<F68_CB_5>0</F68_CB_5>
<F68_CB_8>0</F68_CB_8>
<F34_CB_4>0</F34_CB_4>
<F70_CB_4>0</F70_CB_4>
<F34_CB_5>0</F34_CB_5>
<F34_SE_1>0</F34_SE_1>
<F82_SE_2>0</F82_SE_2>
<F69_SE_1>1</F69_SE_1>
<F1_CMBO_8>0</F1_CMBO_8>
<F105_MEMO_1></F105_MEMO_1>
<F9_RBN_01>2</F9_RBN_01>
<F96_RB_01>1</F96_RB_01>
<F1_RBN_15>1</F1_RBN_15>
<F1_N120>1</F1_N120>
<F64_CB_01>0</F64_CB_01>
<F64_RB_01>1</F64_RB_01>
<F70_CB_03>0</F70_CB_03>
<CD_PARSING_COMBO_5>0</CD_PARSING_COMBO_5>
<F64_CB_02>0</F64_CB_02>
<F60_CB_02>0</F60_CB_02>
<F64_RE_1></F64_RE_1>
<F95_M_1></F95_M_1>
<F1_COMBO_6>0</F1_COMBO_6>
<F40_CHCKBX_555>0</F40_CHCKBX_555>
<F09_CB_01>0</F09_CB_01>
<F48_CB_02>0</F48_CB_02>
<F68_CB_01>0</F68_CB_01>
<F68_CB_02>0</F68_CB_02>
<F68_CB_03>0</F68_CB_03>
<F57_CB_41>0</F57_CB_41>
<F57_CB_43>0</F57_CB_43>
<F57_CB_45>0</F57_CB_45>
<F57_CB_47>0</F57_CB_47>
<F57_CB_49>0</F57_CB_49>
<F57_CB_51>0</F57_CB_51>
<F57_CB_53>0</F57_CB_53>
<F57_CB_55>0</F57_CB_55>
<F57_CB_57>0</F57_CB_57>
<F57_CB_59>0</F57_CB_59>
<F57_CB_61>0</F57_CB_61>
<F57_CB_63>0</F57_CB_63>
<F57_CB_65>0</F57_CB_65>
<F57_CB_67>0</F57_CB_67>
<F57_CB_69>0</F57_CB_69>
<F57_CB_71>0</F57_CB_71>
<F57_CB_73>0</F57_CB_73>
<F57_CB_75>0</F57_CB_75>
<F57_CB_77>0</F57_CB_77>
<F57_CB_79>0</F57_CB_79>
<F57_CB_42>0</F57_CB_42>
<F57_CB_44>0</F57_CB_44>
<F57_CB_46>0</F57_CB_46>
<F57_CB_48>0</F57_CB_48>
<F57_CB_50>0</F57_CB_50>
<F57_CB_52>0</F57_CB_52>
<CD_PARSING_EDIT_93>0</CD_PARSING_EDIT_93>
<CD_PARSING_EDIT_94></CD_PARSING_EDIT_94>
<CD_PARSING_EDIT_57_12></CD_PARSING_EDIT_57_12>
<CD_PARSING_EDIT_57_13></CD_PARSING_EDIT_57_13>
<CD_PARSING_EDIT_57_14></CD_PARSING_EDIT_57_14>
<CD_PARSING_EDIT_57_15></CD_PARSING_EDIT_57_15>
<CD_PARSING_EDIT_57_16></CD_PARSING_EDIT_57_16>
<CD_PARSING_EDIT_57_17></CD_PARSING_EDIT_57_17>
<CD_PARSING_EDIT_57_18></CD_PARSING_EDIT_57_18>
<CD_PARSING_RICH_50_1>[VALUE]</CD_PARSING_RICH_50_1>
<CD_PARSING_EDIT_F9_13>3</CD_PARSING_EDIT_F9_13>
<CD_PARSING_EDIT_F9_18>http://sitename.com</CD_PARSING_EDIT_F9_18>
<CD_PARSING_EDIT_F24_2>1</CD_PARSING_EDIT_F24_2>
<CD_PARSING_EDIT_F48_1></CD_PARSING_EDIT_F48_1>
<CD_PARSING_EDIT_F48_2>10</CD_PARSING_EDIT_F48_2>
<CD_PARSING_EDIT_F48_5>0</CD_PARSING_EDIT_F48_5>
<CD_PARSING_EDIT_F48_3>0</CD_PARSING_EDIT_F48_3>
<CD_PARSING_EDIT_F56_1></CD_PARSING_EDIT_F56_1>
<CD_PARSING_EDIT_F56_2>-</CD_PARSING_EDIT_F56_2>
<CD_PARSING_EDIT_F34_1></CD_PARSING_EDIT_F34_1>
<CD_PARSING_EDIT_F34_3>http://</CD_PARSING_EDIT_F34_3>
<CD_PARSING_EDIT_F40_2>Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 sputnik 2.1.0.18 YB/4.3.0</CD_PARSING_EDIT_F40_2>
<CD_PARSING_EDIT_F46_1></CD_PARSING_EDIT_F46_1>
<CD_PARSING_M49_1> class="entry"
id="news-id-
id="article-text"
</CD_PARSING_M49_1>
<CD_PARSING_M48_1></CD_PARSING_M48_1>
<F90_M_1></F90_M_1>
<CD_PARSING_M48_3></CD_PARSING_M48_3>
<CD_PARSING_SYN_F46_1><CD_CYCLE_GRAN_ALL!></CD_PARSING_SYN_F46_1>
<CD_PARSING_RICH_F9_1></CD_PARSING_RICH_F9_1>
<CD_PARSING_RICH_F9_2></CD_PARSING_RICH_F9_2>
<CD_PARSING_R24_1>0</CD_PARSING_R24_1>
<F1_COMBOBOX_9>0</F1_COMBOBOX_9>
<F1_COMBOBOX_10>2</F1_COMBOBOX_10>
<CD_PARSING_RB_9>0</CD_PARSING_RB_9>
<CD_PARSING_RB_F9_3>0</CD_PARSING_RB_F9_3>
<CD_PARSING_LB_1>http://www.latestvacancies.com/wates/</CD_PARSING_LB_1>
<CD_PARSING_EDIT_26>0</CD_PARSING_EDIT_26>
<CD_PARSING_EDIT_27>Jobs/Advert/</CD_PARSING_EDIT_27>
<CD_PARSING_EDIT_28>0</CD_PARSING_EDIT_28>
<CD_PARSING_EDIT_29>?</CD_PARSING_EDIT_29>
<CD_PARSING_COMBOBOX_1>csv</CD_PARSING_COMBOBOX_1>
<CD_PARSING_RE61_1></CD_PARSING_RE61_1>
<CD_PARSING_CHECK_61_1>1</CD_PARSING_CHECK_61_1>
<CD_PARSING_RB60_1>1</CD_PARSING_RB60_1>
<CD_PARSING_SE60_1>1</CD_PARSING_SE60_1>
答案 0 :(得分:0)
试试这个。
with
语句来读/写文件。file
是内置类,使用类似ifile
的内容。http:[^<]*
。import string
import codecs
import re
import glob
with open('newout.txt', 'w+') as outfile:
try:
for ifile in glob.glob("*.cdp"):
print (ifile)
with open(ifile, 'r') as infile:
for line in infile:
step1 = re.findall(r'(http:[^<]+)', line)
if len(step1) > 0:
outfile.write("%s\n" % step1[0].strip())
except Exception as ex:
print (ex)
答案 1 :(得分:0)
outfile = open('newout.txt', 'w+')
try:
for file in glob.glob("*.cdp"):
print(file)
infile = open(file, 'r')
step1 = re.sub(re.compile('.*[<]CD_PARSING_LB_1[>]', re.DOTALL), '',infile.read())
step2 = re.sub(re.compile('[<]/CD_PARSING_LB_1[>].*', re.DOTALL),'', step1)
outfile.write(str(step2))
except Exception as ex:
print ex
raw_input()
试试这个。
re.sub的四个参数是count
,而不是标志。
并且我认为使用result = re.search('[<]tag1[>](.*)[<]/tag1[>])
,并通过result.group(1)
获取链接可能会更容易。
答案 2 :(得分:0)
使用此正则表达式模式。
String Pattern =&#34;(http:。* = \ d {1,7}}&#34 ;;