Python列表-过滤特定文本元素并删除其余文本元素

时间:2018-08-03 13:07:07

标签: python python-2.7 web-scraping

我正在构建一个LinkedIn爬虫,用于从LinkedIn抓取公司的基本信息。

我有一个文本文件,其中包含公司列表,正在阅读中,然后进行Google搜索以提取第一个链接(searchlinkedin.com +公司名称)。

我将所有链接存储在列表中。现在的问题是一些公司使用不同的语言,我正在获取个人资料的linkedin网址以及一些非linkedin链接。

我的列表看起来像

['https://www.linkedin.com/company/transatl-ntica-viajes-y-turismo',
 'https://co.linkedin.com/in/jose-anibal-lerma-moreno-2b389aa3',
 'https://in.linkedin.com/company/indocol---industrial-de-dotaciones-colombianas',
 'https://www.linkedin.com/in/javier-torres-camargo-b983443a',
 'https://in.linkedin.com/company/sas',
 'https://in.linkedin.com/company/ti-tecnologia-informatica-s-a-s',
 'https://www.linkedin.com/company/henkel_2',
 'https://in.linkedin.com/company/sas',
 'https://www.linkedin.com/company/quimica-vulcano-s-a',
 'https://in.linkedin.com/company/sas',
 'https://www.linkedin.com/company/ismocol-de-colombia-s-a-',
 'https://in.linkedin.com/company/sas',
 'https://www.facebook.com/IMCTCajica/',....

现在,如果您看到此信息,这里有公司链接,以及所有其他链接,我只想提取/保留包含-的链接

"linkedin.com/company"

采取任何相同或更好的方法来获得包含相同内容的最大链接的任何方法。

2 个答案:

答案 0 :(得分:2)

使用列表理解并过滤掉不必要的元素

<form role="form" action="<?php echo base_url() ?>add_customer" method="post">

<select class="form-control" id="customer_id" name="customer_id">
    <?php foreach ( $customer as $cust ){?>
      <option value="<?php echo $datas[0]->customer_id; ?>"<?php if($cust->customer_id==$datas[0]->customer_id) echo 'selected="selected"'; ?>> <?php echo $cust->customer_id; ?></option>
    <?php }?>
</select>
</form>

答案 1 :(得分:2)

您还可以使用filter函数:

inList = ['https://www.linkedin.com/company/transatl-ntica-viajes-y-turismo',
 'https://co.linkedin.com/in/jose-anibal-lerma-moreno-2b389aa3',
 'https://in.linkedin.com/company/indocol---industrial-de-dotaciones-colombianas',
 'https://www.linkedin.com/in/javier-torres-camargo-b983443a',
 'https://in.linkedin.com/company/sas',
 'https://in.linkedin.com/company/ti-tecnologia-informatica-s-a-s',
 'https://www.linkedin.com/company/henkel_2',
 'https://in.linkedin.com/company/sas',
 'https://www.linkedin.com/company/quimica-vulcano-s-a',
 'https://in.linkedin.com/company/sas',
 'https://www.linkedin.com/company/ismocol-de-colombia-s-a-',
 'https://in.linkedin.com/company/sas',
 'https://www.facebook.com/IMCTCajica/']

link = "linkedin.com/company"
outList = list(filter(lambda elem: link in elem, inList))
for i in outList:
    print(i)

输出:

https://www.linkedin.com/company/transatl-ntica-viajes-y-turismo
https://in.linkedin.com/company/indocol---industrial-de-dotaciones-colombianas
https://in.linkedin.com/company/sas
https://in.linkedin.com/company/ti-tecnologia-informatica-s-a-s
https://www.linkedin.com/company/henkel_2
https://in.linkedin.com/company/sas
https://www.linkedin.com/company/quimica-vulcano-s-a
https://in.linkedin.com/company/sas
https://www.linkedin.com/company/ismocol-de-colombia-s-a-
https://in.linkedin.com/company/sas