GitLab omnibus upgrade unhealthy
GitLab omnibus is in unhealthy state three days after upgrade from 17.8 to 18.0, with over 500,000 failed background jobs and gitlab-rake advising "... git bundle ... db:migrate". The command sudo -u git bundle ... fails by what appears a file owner/access issue, "rake ... not found". This is production service used by devs. What is advised recovery strategy from here?
Critical situation โ production GitLab Omnibus 18.0 with a large backlog of background jobs and incomplete migration post-upgrade.
Here's a focused and staged recovery strategy:
๐ Step 1: Assess Current State and Log Everything
Confirm GitLab health checks:
sudo gitlab-rake gitlab:check SANITIZE=true sudo gitlab-rake gitlab:env:info
Identify which background jobs are failing:
sudo gitlab-rake sidekiq:queue_stats
Check logs:
tail -n 100 /var/log/gitlab/gitlab-rails/sidekiq.log tail -n 100 /var/log/gitlab/gitlab-rails/production.log
โ ๏ธ Step 2: Diagnose the rake ... not found
Error
This implies your PATH
or BUNDLE_GEMFILE
is not correctly set, usually due to:
- Running the command as the wrong user
- Not using the correct GitLab environment
โ Use this pattern to ensure correct context:
sudo gitlab-rake db:migrate
or
sudo -u git -H bundle exec rake db:migrate RAILS_ENV=production
If bundle
is missing or misconfigured:
cd /opt/gitlab/embedded/service/gitlab-rails
sudo -u git -H /opt/gitlab/embedded/bin/bundle exec rake db:migrate RAILS_ENV=production
๐งผ Step 3: Clear the Obstruction (Run db:migrate
)
If db:migrate
hasn't completed, GitLab cannot enqueue or process jobs properly.
sudo gitlab-ctl stop sidekiq
sudo gitlab-rake db:migrate
If this fails, record the error. You might need to restore from a backup (see below) or rerun the upgrade with --skip-auto-reconfigure
and step manually.
๐งฏ Step 4: Fail-Safe Options
If migration or the bundle command cannot run:
- Create a snapshot or full VM backup immediately.
- Check if a pre-18.0 backup exists in
/var/opt/gitlab/backups/
or your storage. - Restore to 17.8 on a staging VM and retry upgrade carefully.
๐งฝ Step 5: Queue Cleanup (Only After Data Integrity is Assured)
If background jobs are beyond recovery:
sudo gitlab-rails console
Sidekiq::Queue.all.each(&:clear)
Sidekiq::RetrySet.new.clear
Sidekiq::ScheduledSet.new.clear
โ ๏ธ Dangerous! Only do this if you're certain no important jobs are in queue.
๐ Step 6: Restart and Monitor
sudo gitlab-ctl reconfigure
sudo gitlab-ctl restart
sudo gitlab-rake gitlab:check SANITIZE=true
Use:
watch sudo gitlab-ctl status
๐ Production-Safe Recommendation
If you're uncertain or logs show database issues:
- Restore from latest working backup (or snapshot).
- Rehearse the 17.8โ18.0 upgrade in a staging clone.
- Fix permission, run migrations, verify background jobs before retrying production upgrade.