优化GNU parallels的脚本代码

时间:2018-05-29 12:49:45

标签: bash curl gnu-parallel

我有一个成功查询API的脚本,但速度很慢。获得所有资源大约需要16个小时。我看了一下如何优化它,我认为使用GNU parallels(通过Brew安装在macos上,版本20180522)就可以了。但即使使用90个作业(API端点最多授权100个连接),我的脚本也不会更快。我不确定为什么。

我这样称呼我的脚本:

bash script.sh | parallel -j90

脚本如下:

#!bin/bash 

# This script downloads the list of French MPs who contributed to a specific amendment.
# The script is initialised with a file containing a list of API URLs, each pointing to a resource describing an amendment


# The main function loops over 3 actions:
# 1. assign to $sign the API url that points to the list of amendment authors
# 2. run the functions auteur and cosignataires and save them in their respective variables
# 3. merge the variable contents and append them as a new line into a csv file 
main(){
local file="${1}"
local line
local sign
local auteur_clean
local cosign_clean

while read line
    do
        sign="${line}/signataires"
        auteur_clean=$(auteur $sign)
        cosign_clean=$(cosignataires $sign)
        echo "${auteur_clean}","${cosign_clean}" >> signataires_15.csv
done < "${file}"
}

# The auteur function takes the $sign variable as an input and 
# 1. filters the json returned by the API to get only the author's ID
# 2.use the ID stored in $auteur to query the full author resource and capture the key info, which is then assigned to $auteur_nom
#  3. echo a cleaned version of the info stored in $auteur_nom
auteur(){
local url="${1}"
local auteur
local auteur_nom

auteur=$(curl -s "${url}" | jq '.signataires[] | select(.relation=="auteur") | .id') \
&& auteur_nom=$(curl -s "https://www.parlapi.fr/rest/an/acteurs_amendements/${auteur}" \
| jq -r --arg url "https://www.parlapi.fr/rest/an/acteurs_amendements/${auteur}" '$url, .amendement.id, .acteur.id, (.acteur.prenom + " " + .acteur.nom)') \
&& echo "${auteur_nom}" | tr '\n' ',' | sed 's/,$//'
}

# The cosignataires function takes the $sign variable as an input and 
# 1. filter the json returned by the API to produce a space separated list of co-authors
# 2. iterates over list of coauthors to get their name and surname, and assign the resulting list to $cosign_nom
# 3. echo a semi-colon separated list of the co-author names
cosignataires(){
local url="${1}"
local cosign
local cosign_nom
local i

cosign=$(curl -s "${url}" | jq '.signataires[] | select(.relation=="cosignataire") | .id' | tr '\n' ' ') \
&& cosign_nom=$(for i in ${cosign}; do curl -s "https://www.parlapi.fr/rest/an/acteurs_amendements/${i}" | jq -r '(.acteur.prenom + " " + .acteur.nom)'; done) \
&& echo "${cosign_nom}" | tr '\n' ';' | sed 's/,$//'
}

main "url_amendements_15.txt"

url_amendements_15.txt的内容如下:

https://www.parlapi.fr/rest/an/amendements/AMANR5L15SEA717460BTC0174P0D1N7
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N90
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N134
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N187
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N161

1 个答案:

答案 0 :(得分:3)

您的脚本循环遍历URL列表并按顺序查询它们。您需要将其分解,以便每个API查询单独完成,这样using System.Collections.ObjectModel; using System.ComponentModel; using System.Runtime.CompilerServices; namespace TreeViewDropShadowExampl { public class Node : INotifyPropertyChanged { #region WPF integration properties public event PropertyChangedEventHandler PropertyChanged; protected virtual void OnPropertyChanged([CallerMemberName] string propertyName = null) { PropertyChanged?.Invoke(this, new PropertyChangedEventArgs(propertyName)); } #endregion WPF integration properties public Node(string id) { ID = id; } private string _id; public string ID { get { return _id; } set { _id = value; // Call OnPropertyChanged whenever the property is updated OnPropertyChanged(); } } public ObservableCollection<Node> Children { get; set; } = new ObservableCollection<Node>(); } } 将具有可以并行执行的命令。

更改脚本以使其占用一个网址。摆脱主parallel循环。

while

然后将main() { local url=$1 local sign local auteur_clean local cosign_clean sign=$url/signataires auteur_clean=$(auteur "$sign") cosign_clean=$(cosignataires "$sign") echo "$auteur_clean,$cosign_clean" >> signataires_15.csv } 传递给url_amendements_15.txt。提供 it 可以并行处理的URL列表。

parallel