Question

我想从脚本标记中提取文件链接。

我怎么能在python 2.7中做到这一点？

结构是：

<more script tags>
<script>
var settings=
    {
    primary: 'user-o',
    opt: window.userfiles,
    files:  [
                    {
                        //title: "PDF File",
                        image: 'http://url.com/num-001/cover.jpg',
                        sources:
                            [
                                {
                                    'label': '',
                                    'file': 'http://url.com/user/0054552/file-1.pdf',
                                    'type': 'user-o'
                                },
                                {
                                    'label': '',
                                    'file': 'http://url.com/user/0054552/file-2.pdf'
                                }

                            ],

                        other:
                            [
                                {
                                    file: 'http://url.com/user/0054552/other-file-0.pdf',
                                    kind: 'other-files'
                                }
                            ]
                    }

            ]
    };
</script>
<more script tags>

我需要所有文件链接：

... url.com//user/0054552/file-1.pdf
... url.com//user/0054552/file-2.pdf
... url.com//user/0054552/other-file-0.pdf

我希望你的支持。

谢谢！

Answer 1

由于你已经有了这个脚本，你只需要在文本中对其进行转换并进行处理。

此代码将：

split此脚本文本块位于每个\n
for每个字符串，搜索子字符串.pdf
if字符串包含子字符串.pdf，split字符串位于': '和replace '和,
append url分配到名为list

files

完成

<强>代码：

s = '''<more script tags>
<script>
var settings=
    {
    primary: 'user-o',
    opt: window.userfiles,
    files:  [
                    {
                        //title: "PDF File",
                        image: 'http://url.com/num-001/cover.jpg',
                        sources:
                            [
                                {
                                    'label': '',
                                    'file': 'http://url.com/user/0054552/file-1.pdf',
                                    'type': 'user-o'
                                },
                                {
                                    'label': '',
                                    'file': 'http://url.com/user/0054552/file-2.pdf'
                                }

                            ],

                        other:
                            [
                                {
                                    file: 'http://url.com/user/0054552/other-file-0.pdf',
                                    kind: 'other-files'
                                }
                            ]
                    }

            ]
    };
</script>
<more script tags>'''

files = []
data = s.split('\n')
for d in data:
    if '.pdf' in d:
        url = d.split(": ")[1].replace("'", "").replace(",", "")
        print(url)
        files.append(url)

<强>输出：

'http://url.com/user/0054552/file-1.pdf'
'http://url.com/user/0054552/file-2.pdf'
'http://url.com/user/0054552/other-file-0.pdf'

Beautifulsoup - 从脚本标记中获取值

1 个答案: