GitLab omnibus upgrade unhealthy

GitLab omnibus is in unhealthy state three days after upgrade from 17.8 to 18.0, with over 500,000 failed background jobs and gitlab-rake advising "... git bundle ... db:migrate". The command sudo -u git bundle ... fails by what appears a file owner/access issue, "rake ... not found". This is production service used by devs. What is advised recovery strategy from here?

Critical situation โ€” production GitLab Omnibus 18.0 with a large backlog of background jobs and incomplete migration post-upgrade.

Here's a focused and staged recovery strategy:


๐Ÿ” Step 1: Assess Current State and Log Everything

  1. Confirm GitLab health checks:

    sudo gitlab-rake gitlab:check SANITIZE=true
    sudo gitlab-rake gitlab:env:info
    
  2. Identify which background jobs are failing:

    sudo gitlab-rake sidekiq:queue_stats
    
  3. Check logs:

    tail -n 100 /var/log/gitlab/gitlab-rails/sidekiq.log
    tail -n 100 /var/log/gitlab/gitlab-rails/production.log
    

โš ๏ธ Step 2: Diagnose the rake ... not found Error

This implies your PATH or BUNDLE_GEMFILE is not correctly set, usually due to:

โœ… Use this pattern to ensure correct context:

sudo gitlab-rake db:migrate

or

sudo -u git -H bundle exec rake db:migrate RAILS_ENV=production

If bundle is missing or misconfigured:

cd /opt/gitlab/embedded/service/gitlab-rails
sudo -u git -H /opt/gitlab/embedded/bin/bundle exec rake db:migrate RAILS_ENV=production

๐Ÿงผ Step 3: Clear the Obstruction (Run db:migrate)

If db:migrate hasn't completed, GitLab cannot enqueue or process jobs properly.

sudo gitlab-ctl stop sidekiq
sudo gitlab-rake db:migrate

If this fails, record the error. You might need to restore from a backup (see below) or rerun the upgrade with --skip-auto-reconfigure and step manually.


๐Ÿงฏ Step 4: Fail-Safe Options

If migration or the bundle command cannot run:

  1. Create a snapshot or full VM backup immediately.
  2. Check if a pre-18.0 backup exists in /var/opt/gitlab/backups/ or your storage.
  3. Restore to 17.8 on a staging VM and retry upgrade carefully.

๐Ÿงฝ Step 5: Queue Cleanup (Only After Data Integrity is Assured)

If background jobs are beyond recovery:

sudo gitlab-rails console
Sidekiq::Queue.all.each(&:clear)
Sidekiq::RetrySet.new.clear
Sidekiq::ScheduledSet.new.clear

โš ๏ธ Dangerous! Only do this if you're certain no important jobs are in queue.


๐Ÿ” Step 6: Restart and Monitor

sudo gitlab-ctl reconfigure
sudo gitlab-ctl restart
sudo gitlab-rake gitlab:check SANITIZE=true

Use:

watch sudo gitlab-ctl status

๐Ÿ”’ Production-Safe Recommendation

If you're uncertain or logs show database issues:

  1. Restore from latest working backup (or snapshot).
  2. Rehearse the 17.8โ†’18.0 upgrade in a staging clone.
  3. Fix permission, run migrations, verify background jobs before retrying production upgrade.