Parallel download of rpm packages

awerlang · April 1, 2020, 11:44pm

Zypper downloads packages serially, and having about a thousand packages to update a week, it gets quite boring. Also, while zypper can download packages one-by-one in advance, it can’t be called concurrently. I found the libzypp-bindings project but it is discontinued. I set myself to improve the situation.

Goals:

[ul]
[li]Download all repositories in parallel (most often different servers); [/li][li]Download up to MAX_PROC (=6) packages from each repository in parallel; [/li][li]Save packages where zypper picks them up during system update: /var/cache/zypp/packages; [/li][li]Alternatively, download to $HOME/.cache/zypp/packages; [/li][li]Avoid external dependencies, unless necessary. [/li][/ul]

Outline:

[ol]
[li]Find the list of packages to update; [/li][li]Find the list of repositories; [/li][li]For each repository:[/li][LIST=1]
[li]Keep up to $MAX_PROC curl processes downloading packages. [/li][/ol]

[li]Copy files to default package cache. [/li][/LIST]

Results & Open Issues:

[ol]
[li]Great throughput: 2,152 kb/s vs 783 kb/s; [/li][li]zypper list-updates doesn’t give new required/recommended packages, so they are not present in cache with my routine. Any tip to get them as well? [/li][li]I’m not sure if wait -n returns to the same background function that dispatched a download request, or if any can capture the “wait”, or all of them resume from a single process exit. This may lead to unbalanced MAX_PROC per repository, specially if a single & different process capture the wait. Does someone knows which primitive the wait is modeled after (mutex/semaphore, etc) or how it works when there’s more than one wait? [/li][li]Also I’m not sure if I should just use the local cache or the system cache, although it’s a minor issue. [/li][/ol]


#!/bin/bash

MAX_PROC=6

function repos_to_update () {
    zypper list-updates | grep '^v ' | awk -F '|' '{ print $2 }' | sort --unique | tr -d ' '
}

function packages_from_repo () {
    local repo=$1

    zypper list-updates | grep " | $repo " | awk -F '|' '{ print $6, "#", $3, "-", $5, ".", $6, ".rpm" }' | tr -d ' '
}

function repo_uri () {
    local repo=$1

    zypper repos --uri | grep " | $repo " | awk -F '|' '{ print $7 }' | tr -d ' '
}

function repo_alias () {
    local repo=$1

    zypper repos | grep " | $repo " | awk -F '|' '{ print $2 }' | tr -d ' '
}

function download_package () {
    local alias=$1
    local uri=$2
    local line=$3
    IFS=# read arch package_name <<< "$line"

    local package_uri="$uri/$arch/$package_name"
    local local_dir="$HOME/.cache/zypp/packages/$alias/$arch"
    local local_path="$local_dir/$package_name"
    printf -v y %-30s "$repo"
    printf "Repository: $y Package: $package_name
"
    if [ ! -f "$local_path" ]; then
        mkdir -p $local_dir
        curl --silent --fail -L -o $local_path $package_uri
    fi
}

function download_repo () {
    local repo=$1

    local uri=$(repo_uri $repo)
    local alias=$(repo_alias $repo)
    local pkgs=$(packages_from_repo $repo)
    local max_proc=$MAX_PROC
    while IFS= read -r line; do
        if [ $max_proc -eq 0 ]; then
            wait -n
            ((max_proc++))
        fi
        download_package "$alias" "$uri" "$line" &
        ((max_proc--))
    done <<< "$pkgs"
}

function download_all () {
    local repos=$(repos_to_update)
    while IFS= read -r line; do
        download_repo $line &
    done <<< "$repos"
    wait
}

download_all
#sudo cp -r ~/.cache/zypp/packages/* /var/cache/zypp/packages/

nrickert · April 2, 2020, 1:35am

You can do:

zypper dup --download-only

Do that ahead of time, while you are using your computer for other things.

Then, when you are ready to actually update, the packages have already been downloaded. So the update goes a lot faster.

awerlang · April 2, 2020, 2:24am

nrickert:

You can do:
zypper dup --download-only
Do that ahead of time, while you are using your computer for other things.

Then, when you are ready to actually update, the packages have already been downloaded. So the update goes a lot faster.

Yep, I have a workflow like that as well, which reduces the usefulness of this script. I think I can let it downloading in background while I read the review of the week.
This was close to the bottom on my priority list… but I guess I’m bad at sticking to priorities :shame:
It would be more useful for a big install, which this script doesn’t handle though.

Btw, I tested with a single repo and the system locked up! I believe it’s the way subdirectories are created.

SO EVERYONE, PLEASE DON’T RUN THE SCRIPT!!

awerlang · April 2, 2020, 2:41am

I was selecting the single repo with the --repo parameter, which results in different output format, so instead of 1 process, it was spawning almost 1000 processes. Fun!

tsu2 · April 2, 2020, 3:26am

Not based on tracing, only on my hopefully accurate reading of your code,
I think I see each process downloading from an assigned repository…
So, your parallelism is entirely related to however many repositories are configured (maximum 6) and how evenly loads are distributed across all repositories.
If so, maybe that can be improved upon since of course you’d ideally want to “keep the pipe filled” as much as possible regardless how many packages come from any repo.

A definite suggestion, if you’re going to download files in bulk, I don’t know if repos still support ftp, but downloading multiple files within the same ftp session can greatly increase efficiency by not recreating sessions for each package. Is why http/https is most often used when downloading many tiny individual files from multiple sources (eg web pages like news sites which aggregate data from multiple sources) and not files of varying sizes from the same server (like a repository). http is OK for current, normal zypper use because each file is often installed before downloading the next but maybe is inefficient for what you’re doing. It might be that some mirrors might support ftp while others don’t (I haven’t checked).

Although you’re writing your code entirely within a bash script,
I think that nowadays someone might consider deploying in systemd Unit files, spawning “instantiated Units” dynamically as needed.
The MAN pages for instantiated Units is described really poorly, if you want to look into this I’d recommend searching for working examples instead.
If you do this and create a generic capability separate from your specific use, you might even end up creating a really useful building block for various zypper functions… eg for updates, upgrades, refreshes, installations if someone wanted to replace large images as is now done with package downloads which would be more consistent with how other apps work, etc.
You could become famous for creating something that might last years maybe decades into the future…

An interesting thing you’re doing,
TSU

awerlang · April 2, 2020, 4:28am

There’s two levels of parallelism: 1 process for each repository (unbounded), each process running a maximum of 6 parallel downloads. Each queue is independent from each other.

Good catch! That could shave a minute or two for a slow server. There’s at least one mirror which also supports HTTP/2. And keep-alive for HTTP/1 servers. I’ll look into that.

Sounds great! Except I don’t have any idea what you’re talking about! Just kidding =p
Re-reading the comment, you mean breaking the monolith, plugging escape mechanisms… interesting…

Thanks for the suggestions

jetchisel · April 2, 2020, 4:49am

Hi,

Imo if you’re doing

grep ... | awk ...

or

awk ... | grep ...

Can be done with just awk.

In your example

zypper repos --uri | grep " | $repo " | awk -F '|' '{ print $7 }' | tr -d ' '

Can be written as

zypper repos --uri  | awk -v name="$1" '$3 == name{print $NF}'

Where “$1” in the variable assignment is a shell variable / positional parameter and has nothing to do with awk.

Also

zypper repos | grep " | $repo " | awk -F '|' '{ print $2 }' | tr -d ' '

Can be written as

zypper repos --uri  | awk -v name="$1" '$3 == name{print $3}

Also have a look at https://www.gnu.org/software/parallel/ for downloading package since the topic of the post is parallel download.

awerlang · April 2, 2020, 5:13am

jetchisel:

Hi,

Imo if you’re doing
grep ... | awk ...
or
awk ... | grep ...
Can be done with just awk.

In your example
zypper repos --uri | grep " | $repo " | awk -F '|' '{ print $7 }' | tr -d ' '
Can be written as
zypper repos --uri  | awk -v name="$1" '$3 == name{print $NF}'
Where “$1” in the variable assignment is a shell variable / positional parameter and has nothing to do with awk.

Yeah, I knew awk could do this sort of things just never thought it would be this easy! Need to get some time to teach myself awk/sed.

I’m learned about it from another attempt on this same issue, but I’d rather avoid dependencies for this script. Alas, I don’t think it would support reusing connections (to-do).

knurpht · April 2, 2020, 5:36am

Another thing to consider: If all packages are cached, it can still happen that new ones are added and repos need refreshing. What i’ve ran into, is that cached packages did not meet the checksums provided from the repo refresh, and would generate errors.

That said, I can see the idea of paralel downloading to the max bandwidth.

And that said, I’m in NL with a very stable 300/30 Mbit internet connection and apparently a couple of good mirrors.

karlmistelberger · April 2, 2020, 8:17am

Actually I never worried about download speed. I am doing frequent updates in a konsole (graphical mode) since 4 years. Running “zypper dup” never would tear down the machines. Users shy of doing this can do the update in 2 steps:

Run “zypper dup --download-only” in graphical mode in the background. You still can use the machine without risk and being fully functional.
Switch to a virtual console and run “zypper dup”. Doing this download speed is not an issue.

karlmistelberger · April 2, 2020, 8:25am

Well, no need for mirrors, as openSUSE resides round the corner:

erlangen:~ # ping -c 1 pontifex.opensuse.org
PING pontifex.opensuse.org(pontifex.opensuse.org (2620:113:80c0:8::13)) 56 data bytes
64 bytes from pontifex.opensuse.org (2620:113:80c0:8::13): icmp_seq=1 ttl=59 time=16.0 ms

--- pontifex.opensuse.org ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 16.002/16.002/16.002/0.000 ms
erlangen:~ #

I am paying for 25/5 Mbit only, but updating works like a charm. Nice to hear mirroring works well 20,000 km away.

awerlang · April 2, 2020, 4:18pm

Good catch! I have disabled auto refreshing, but I should either find a way to download new packages (maybe parsing zypper dup output or somehow calling libzypp directly), or follow with a zypper dup giving up on parallel download for those.

karlmistelberger:

Well, no need for mirrors, as openSUSE resides round the corner:
erlangen:~ # ping -c 1 pontifex.opensuse.org
PING pontifex.opensuse.org(pontifex.opensuse.org (2620:113:80c0:8::13)) 56 data bytes
64 bytes from pontifex.opensuse.org (2620:113:80c0:8::13): icmp_seq=1 ttl=59 time=16.0 ms

--- pontifex.opensuse.org ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 16.002/16.002/16.002/0.000 ms
erlangen:~ # 
I am paying for 25/5 Mbit only, but updating works like a charm. Nice to hear mirroring works well 20,000 km away.

I can see why you’re not worried about that

Consider the server closest to me (756km):

PING opensuse.c3sl.ufpr.br(opensuse.c3sl.ufpr.br (2801:82:80ff:8000::b)) 56 data bytes
64 bytes from opensuse.c3sl.ufpr.br (2801:82:80ff:8000::b): icmp_seq=1 ttl=43 time=579 ms

--- opensuse.c3sl.ufpr.br ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 578.850/578.850/578.850/0.000 ms

Repeat a thousand times… Serial throughtput is 300-1000 kB/s. There’s a faster server (1228 km) but unfortunately it only serves Leap packages. I reached out to them but the email returned. Maybe I’ll switch to a farther away server.

awerlang · April 3, 2020, 3:58am

So I looked into ftp (no quality servers nearby), curl --parallel (unreliable), curl url-list… (okay), pick local mirror directly (better), and then I settled with aria2. aria2 can take advantage of global mirror infrastructure to split resources into parts. From 20m down to 3:30m. I think I can stop squeezing the last bit of speed by now.

Thanks everyone for your input!