Page 1 of 2 12 LastLast
Results 1 to 10 of 13

Thread: Parallel download of rpm packages

  1. #1
    Join Date
    Mar 2020
    Location
    São Leopoldo, RS, Brazil
    Posts
    281

    Default Parallel download of rpm packages

    Zypper downloads packages serially, and having about a thousand packages to update a week, it gets quite boring. Also, while zypper can download packages one-by-one in advance, it can't be called concurrently. I found the libzypp-bindings project but it is discontinued. I set myself to improve the situation.

    Goals:


    • Download all repositories in parallel (most often different servers);
    • Download up to MAX_PROC (=6) packages from each repository in parallel;
    • Save packages where zypper picks them up during system update: /var/cache/zypp/packages;
    • Alternatively, download to $HOME/.cache/zypp/packages;
    • Avoid external dependencies, unless necessary.


    Outline:


    1. Find the list of packages to update;
    2. Find the list of repositories;
    3. For each repository:
      1. Keep up to $MAX_PROC curl processes downloading packages.

    4. Copy files to default package cache.


    Results & Open Issues:


    1. Great throughput: 2,152 kb/s vs 783 kb/s;
    2. zypper list-updates doesn't give new required/recommended packages, so they are not present in cache with my routine. Any tip to get them as well?
    3. I'm not sure if wait -n returns to the same background function that dispatched a download request, or if any can capture the "wait", or all of them resume from a single process exit. This may lead to unbalanced MAX_PROC per repository, specially if a single & different process capture the wait. Does someone knows which primitive the wait is modeled after (mutex/semaphore, etc) or how it works when there's more than one wait?
    4. Also I'm not sure if I should just use the local cache or the system cache, although it's a minor issue.


    Code:
    #!/bin/bash
    
    MAX_PROC=6
    
    function repos_to_update () {
        zypper list-updates | grep '^v ' | awk -F '|' '{ print $2 }' | sort --unique | tr -d ' '
    }
    
    function packages_from_repo () {
        local repo=$1
    
        zypper list-updates | grep " | $repo " | awk -F '|' '{ print $6, "#", $3, "-", $5, ".", $6, ".rpm" }' | tr -d ' '
    }
    
    function repo_uri () {
        local repo=$1
    
        zypper repos --uri | grep " | $repo " | awk -F '|' '{ print $7 }' | tr -d ' '
    }
    
    function repo_alias () {
        local repo=$1
    
        zypper repos | grep " | $repo " | awk -F '|' '{ print $2 }' | tr -d ' '
    }
    
    function download_package () {
        local alias=$1
        local uri=$2
        local line=$3
        IFS=# read arch package_name <<< "$line"
    
        local package_uri="$uri/$arch/$package_name"
        local local_dir="$HOME/.cache/zypp/packages/$alias/$arch"
        local local_path="$local_dir/$package_name"
        printf -v y %-30s "$repo"
        printf "Repository: $y Package: $package_name\n"
        if [ ! -f "$local_path" ]; then
            mkdir -p $local_dir
            curl --silent --fail -L -o $local_path $package_uri
        fi
    }
    
    function download_repo () {
        local repo=$1
    
        local uri=$(repo_uri $repo)
        local alias=$(repo_alias $repo)
        local pkgs=$(packages_from_repo $repo)
        local max_proc=$MAX_PROC
        while IFS= read -r line; do
            if [ $max_proc -eq 0 ]; then
                wait -n
                ((max_proc++))
            fi
            download_package "$alias" "$uri" "$line" &
            ((max_proc--))
        done <<< "$pkgs"
    }
    
    function download_all () {
        local repos=$(repos_to_update)
        while IFS= read -r line; do
            download_repo $line &
        done <<< "$repos"
        wait
    }
    
    download_all
    #sudo cp -r ~/.cache/zypp/packages/* /var/cache/zypp/packages/
    openSUSE Tumbleweed

  2. #2
    Join Date
    Aug 2010
    Location
    Chicago suburbs
    Posts
    14,852
    Blog Entries
    3

    Default Re: Parallel download of rpm packages

    You can do:
    Code:
    zypper dup --download-only
    Do that ahead of time, while you are using your computer for other things.

    Then, when you are ready to actually update, the packages have already been downloaded. So the update goes a lot faster.
    openSUSE Leap 15.2; KDE Plasma 5.18.5;

  3. #3
    Join Date
    Mar 2020
    Location
    São Leopoldo, RS, Brazil
    Posts
    281

    Default Re: Parallel download of rpm packages

    Quote Originally Posted by nrickert View Post
    You can do:
    Code:
    zypper dup --download-only
    Do that ahead of time, while you are using your computer for other things.

    Then, when you are ready to actually update, the packages have already been downloaded. So the update goes a lot faster.
    Yep, I have a workflow like that as well, which reduces the usefulness of this script. I think I can let it downloading in background while I read the review of the week.
    This was close to the bottom on my priority list... but I guess I'm bad at sticking to priorities
    It would be more useful for a big install, which this script doesn't handle though.

    Btw, I tested with a single repo and the system locked up! I believe it's the way subdirectories are created.

    SO EVERYONE, PLEASE DON'T RUN THE SCRIPT!!
    openSUSE Tumbleweed

  4. #4
    Join Date
    Mar 2020
    Location
    São Leopoldo, RS, Brazil
    Posts
    281

    Default Re: Parallel download of rpm packages

    Quote Originally Posted by awerlang View Post
    Btw, I tested with a single repo and the system locked up! I believe it's the way subdirectories are created.

    SO EVERYONE, PLEASE DON'T RUN THE SCRIPT!!
    I was selecting the single repo with the --repo parameter, which results in different output format, so instead of 1 process, it was spawning almost 1000 processes. Fun!
    openSUSE Tumbleweed

  5. #5
    Join Date
    Jun 2008
    Location
    San Diego, Ca, USA
    Posts
    13,182
    Blog Entries
    2

    Default Re: Parallel download of rpm packages

    Not based on tracing, only on my hopefully accurate reading of your code,
    I think I see each process downloading from an assigned repository...
    So, your parallelism is entirely related to however many repositories are configured (maximum 6) and how evenly loads are distributed across all repositories.
    If so, maybe that can be improved upon since of course you'd ideally want to "keep the pipe filled" as much as possible regardless how many packages come from any repo.

    A definite suggestion, if you're going to download files in bulk, I don't know if repos still support ftp, but downloading multiple files within the same ftp session can greatly increase efficiency by not recreating sessions for each package. Is why http/https is most often used when downloading many tiny individual files from multiple sources (eg web pages like news sites which aggregate data from multiple sources) and not files of varying sizes from the same server (like a repository). http is OK for current, normal zypper use because each file is often installed before downloading the next but maybe is inefficient for what you're doing. It might be that some mirrors might support ftp while others don't (I haven't checked).

    Although you're writing your code entirely within a bash script,
    I think that nowadays someone might consider deploying in systemd Unit files, spawning "instantiated Units" dynamically as needed.
    The MAN pages for instantiated Units is described really poorly, if you want to look into this I'd recommend searching for working examples instead.
    If you do this and create a generic capability separate from your specific use, you might even end up creating a really useful building block for various zypper functions... eg for updates, upgrades, refreshes, installations if someone wanted to replace large images as is now done with package downloads which would be more consistent with how other apps work, etc.
    You could become famous for creating something that might last years maybe decades into the future...

    An interesting thing you're doing,
    TSU
    Beginner Wiki Quickstart - https://en.opensuse.org/User:Tsu2/Quickstart_Wiki
    Solved a problem recently? Create a wiki page for future personal reference!
    Learn something new?
    Attended a computing event?
    Post and Share!

  6. #6
    Join Date
    Mar 2020
    Location
    São Leopoldo, RS, Brazil
    Posts
    281

    Default Re: Parallel download of rpm packages

    Quote Originally Posted by tsu2 View Post
    Not based on tracing, only on my hopefully accurate reading of your code,
    I think I see each process downloading from an assigned repository...
    So, your parallelism is entirely related to however many repositories are configured (maximum 6) and how evenly loads are distributed across all repositories.
    If so, maybe that can be improved upon since of course you'd ideally want to "keep the pipe filled" as much as possible regardless how many packages come from any repo.
    There's two levels of parallelism: 1 process for each repository (unbounded), each process running a maximum of 6 parallel downloads. Each queue is independent from each other.


    Quote Originally Posted by tsu2 View Post
    A definite suggestion, if you're going to download files in bulk, I don't know if repos still support ftp, but downloading multiple files within the same ftp session can greatly increase efficiency by not recreating sessions for each package. Is why http/https is most often used when downloading many tiny individual files from multiple sources (eg web pages like news sites which aggregate data from multiple sources) and not files of varying sizes from the same server (like a repository). http is OK for current, normal zypper use because each file is often installed before downloading the next but maybe is inefficient for what you're doing. It might be that some mirrors might support ftp while others don't (I haven't checked).
    Good catch! That could shave a minute or two for a slow server. There's at least one mirror which also supports HTTP/2. And keep-alive for HTTP/1 servers. I'll look into that.

    Quote Originally Posted by tsu2 View Post
    Although you're writing your code entirely within a bash script,
    I think that nowadays someone might consider deploying in systemd Unit files, spawning "instantiated Units" dynamically as needed.
    The MAN pages for instantiated Units is described really poorly, if you want to look into this I'd recommend searching for working examples instead.
    If you do this and create a generic capability separate from your specific use, you might even end up creating a really useful building block for various zypper functions... eg for updates, upgrades, refreshes, installations if someone wanted to replace large images as is now done with package downloads which would be more consistent with how other apps work, etc.
    Sounds great! Except I don't have any idea what you're talking about! Just kidding =p
    Re-reading the comment, you mean breaking the monolith, plugging escape mechanisms... interesting...

    Thanks for the suggestions
    openSUSE Tumbleweed

  7. #7

    Default Re: Parallel download of rpm packages

    Hi,

    Imo if you're doing

    Code:
    grep ... | awk ...
    or

    Code:
    awk ... | grep ...
    Can be done with just awk.

    In your example

    Code:
    zypper repos --uri | grep " | $repo " | awk -F '|' '{ print $7 }' | tr -d ' '
    Can be written as


    Code:
    zypper repos --uri  | awk -v name="$1" '$3 == name{print $NF}'
    Where "$1" in the variable assignment is a shell variable / positional parameter and has nothing to do with awk.


    Also

    Code:
    zypper repos | grep " | $repo " | awk -F '|' '{ print $2 }' | tr -d ' '
    Can be written as

    Code:
    zypper repos --uri  | awk -v name="$1" '$3 == name{print $3}


    Also have a look at https://www.gnu.org/software/parallel/ for downloading package since the topic of the post is parallel download.
    "Unfortunately time is always against us" -- [Morpheus]

    .:https://github.com/Jetchisel:.

  8. #8
    Join Date
    Mar 2020
    Location
    São Leopoldo, RS, Brazil
    Posts
    281

    Default Re: Parallel download of rpm packages

    Quote Originally Posted by jetchisel View Post
    Hi,

    Imo if you're doing

    Code:
    grep ... | awk ...
    or

    Code:
    awk ... | grep ...
    Can be done with just awk.

    In your example

    Code:
    zypper repos --uri | grep " | $repo " | awk -F '|' '{ print $7 }' | tr -d ' '
    Can be written as


    Code:
    zypper repos --uri  | awk -v name="$1" '$3 == name{print $NF}'
    Where "$1" in the variable assignment is a shell variable / positional parameter and has nothing to do with awk.
    Yeah, I knew awk could do this sort of things just never thought it would be this easy! Need to get some time to teach myself awk/sed.

    Quote Originally Posted by jetchisel View Post
    Also have a look at https://www.gnu.org/software/parallel/ for downloading package since the topic of the post is parallel download.
    I'm learned about it from another attempt on this same issue, but I'd rather avoid dependencies for this script. Alas, I don't think it would support reusing connections (to-do).
    openSUSE Tumbleweed

  9. #9
    Join Date
    Jun 2008
    Location
    Groningen, Netherlands
    Posts
    20,889
    Blog Entries
    14

    Default Re: Parallel download of rpm packages

    Another thing to consider: If all packages are cached, it can still happen that new ones are added and repos need refreshing. What i've ran into, is that cached packages did not meet the checksums provided from the repo refresh, and would generate errors.

    That said, I can see the idea of paralel downloading to the max bandwidth.

    And that said, I'm in NL with a very stable 300/30 Mbit internet connection and apparently a couple of good mirrors.
    ° Appreciate my reply? Click the star and let me know why.

    ° Perfection is not gonna happen. No way.

    http://en.opensuse.org/User:Knurpht
    http://nl.opensuse.org/Gebruiker:Knurpht

  10. #10
    Join Date
    Jan 2014
    Location
    Erlangen
    Posts
    2,301
    Blog Entries
    1

    Default Re: Parallel download of rpm packages

    Quote Originally Posted by awerlang View Post
    Zypper downloads packages serially, and having about a thousand packages to update a week, it gets quite boring. Also, while zypper can download packages one-by-one in advance, it can't be called concurrently. I found the libzypp-bindings project but it is discontinued. I set myself to improve the situation.
    Actually I never worried about download speed. I am doing frequent updates in a konsole (graphical mode) since 4 years. Running "zypper dup" never would tear down the machines. Users shy of doing this can do the update in 2 steps:

    • Run "zypper dup --download-only" in graphical mode in the background. You still can use the machine without risk and being fully functional.
    • Switch to a virtual console and run "zypper dup". Doing this download speed is not an issue.
    AMD Athlon 4850e (2009), openSUSE 13.1, KDE 4, Intel i3-4130 (2014), i7-6700K (2016), i5-8250U (2018), AMD Ryzen 5 3400G (2020), openSUSE Tumbleweed, KDE Plasma 5

Page 1 of 2 12 LastLast

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •