Troubleshoot batch jobs
Estimated reading time: 9 minutesDTR uses a job queue to schedule batch jobs. A job is placed on this work queue, and a job runner component of DTR consumes work from this cluster-wide job queue and executes it.
All DTR replicas have access to the job queue, and have a job runner component that can get and execute work.
How it works
When a job is created, it is added to a cluster-wide job queue with the
waiting
status.
When one of the DTR replicas is ready to claim, it waits a random time of up
to 3 seconds, giving the opportunity to every replica to claim the task.
A replica gets a job by adding it’s replica ID to the job. That way, other
replicas know the job has been claimed. Once a replica claims a job it adds
it to an internal queue of jobs, that is sorted by their scheduledAt
time.
When that time happens, the replica updates the job status to running
, and
starts executing it.
The job runner component of each DTR replica keeps an heartbeatExpiration
entry on the database shared by all replicas. If a replica becomes
unhealthy, other replicas notice this and update that worker status to dead
.
Also, all the jobs that replica had claimed are updated to the status worker_dead
,
so that other replicas can claim the job.
Job types
DTR has several types of jobs.
Job | Description |
---|---|
gc | Garbage collection job that deletes layers associated with deleted images |
sleep | Sleep is used to test the correctness of the jobrunner. It sleeps for 60 seconds |
false | False is used to test the correctness of the jobrunner. It runs the false command and immediately fails |
tagmigration | Tag migration is used to sync tag and manifest information from the blobstore into the database. This information is used to for information in the API, UI, and also for GC |
bloblinkmigration | bloblinkmigration is a 2.1 to 2.1 upgrade process that adds references for blobs to repositories in the database |
license_update | License update checks for license expiration extensions if online license updates are enabled |
nautilus_scan_check | An image security scanning job. This job does not perform the actual scanning, rather it spawns nautilus_scan_check_single jobs (one for each layer in the image). Once all of the nautilus_scan_check_single jobs are complete, this job will terminate |
nautilus_scan_check_single | A security scanning job for a particular layer given by the parameter: SHA256SUM . This job breaks up the layer into components and checks each component for vulnerabilities |
nautilus_update_db | A job that is created to update DTR’s vulnerability database. It uses an Internet connection to check for database updates through https://dss-cve-updates.docker.com/ and updates the dtr-scanningstore container if there is a new update available |
webhook | A job that is used to dispatch a webhook payload to a single endpoint |
Job status
Jobs can be in one of the following status:
Status | Description |
---|---|
waiting | The job is unclaimed and waiting to be picked up by a worker |
running | The worker defined by workerID is currently running the job |
done | The job has successfully completed |
error | The job has completed with errors |
cancel_request | The worker monitors the job statuses in the database. If the status for a job changes to cancel_request , the worker will cancel the job |
cancel | The job has been cancelled and not fully executed |
deleted | The job and logs have been removed |
worker_dead | The worker for this job has been declared dead and the job will not continue |
worker_shutdown | The worker that was running this job has been gracefully stopped |
worker_resurrection | The worker for this job has reconnected to the database and will cancel these jobs |
Job capacity
Each job runner has a limited capacity and doesn’t claim jobs that require a
higher capacity. You can see the capacity of a job runner using the
GET /api/v0/workers
endpoint:
{
"workers": [
{
"id": "000000000000",
"status": "running",
"capacityMap": {
"scan": 1,
"scanCheck": 1
},
"heartbeatExpiration": "2017-02-18T00:51:02Z"
}
]
}
This means that the worker with replica ID 000000000000
has a capacity of 1
scan
and 1 scanCheck
. If this worker notices that the following jobs
are available:
{
"jobs": [
{
"id": "0",
"workerID": "",
"status": "waiting",
"capacityMap": {
"scan": 1
}
},
{
"id": "1",
"workerID": "",
"status": "waiting",
"capacityMap": {
"scan": 1
}
},
{
"id": "2",
"workerID": "",
"status": "waiting",
"capacityMap": {
"scanCheck": 1
}
}
]
}
Our worker can pick up job id 0
and 2
since it has the capacity
for both, while id 1
needs to wait until the previous scan job is complete:
{
"jobs": [
{
"id": "0",
"workerID": "000000000000",
"status": "running",
"capacityMap": {
"scan": 1
}
},
{
"id": "1",
"workerID": "",
"status": "waiting",
"capacityMap": {
"scan": 1
}
},
{
"id": "2",
"workerID": "000000000000",
"status": "running",
"capacityMap": {
"scanCheck": 1
}
}
]
}
Troubleshoot jobs
You can get the list of jobs, using the GET /api/v0/jobs/
endpoint. Each job
looks like this:
{
"id": "1fcf4c0f-ff3b-471a-8839-5dcb631b2f7b",
"retryFromID": "1fcf4c0f-ff3b-471a-8839-5dcb631b2f7b",
"workerID": "000000000000",
"status": "done",
"scheduledAt": "2017-02-17T01:09:47.771Z",
"lastUpdated": "2017-02-17T01:10:14.117Z",
"action": "nautilus_scan_check_single",
"retriesLeft": 0,
"retriesTotal": 0,
"capacityMap": {
"scan": 1
},
"parameters": {
"SHA256SUM": "1bacd3c8ccb1f15609a10bd4a403831d0ec0b354438ddbf644c95c5d54f8eb13"
},
"deadline": "",
"stopTimeout": ""
}
The fields of interest here are:
id
: the ID of the jobworkerID
: the ID of the worker in a DTR replica that is running this jobstatus
: the current state of the jobaction
: what job the worker will actually performcapacityMap
: the available capacity a worker needs for this job to run
Cron jobs
Several of the jobs performed by DTR are run in a recurrent schedule. You can
see those jobs using the GET /api/v0/crons
endpoint:
{
"crons": [
{
"id": "48875b1b-5006-48f5-9f3c-af9fbdd82255",
"action": "license_update",
"schedule": "57 54 3 * * *",
"retries": 2,
"capacityMap": null,
"parameters": null,
"deadline": "",
"stopTimeout": "",
"nextRun": "2017-02-22T03:54:57Z"
},
{
"id": "b1c1e61e-1e74-4677-8e4a-2a7dacefffdc",
"action": "nautilus_update_db",
"schedule": "0 0 3 * * *",
"retries": 0,
"capacityMap": null,
"parameters": null,
"deadline": "",
"stopTimeout": "",
"nextRun": "2017-02-22T03:00:00Z"
}
]
}
The schedule
uses a Unix crontab syntax.