Cronjob to enable timed deep scrubbing in a ceph-cluster

If you have a ceph-cluster that has deterministic IO-patterns, you might want to time the deep-scrubbing during a period with lower IO activity. Probesys had published a script that could run from cron and deep-scrub the oldest n percent of the PGs.

Unfortunately with newer ceph versions the output of some commands has changed. So I did some adjustments to make it work again. Also I made it a bit less verbose, so it would be better for a cronjob. Here is the result:

#!/bin/bash

set -o nounset
set -o errexit

CEPH=/usr/bin/ceph
AWK=/usr/bin/awk
SORT=/usr/bin/sort
HEAD=/usr/bin/head
DATE=/bin/date
SED=/bin/sed
GREP=/bin/grep
PYTHON=/usr/bin/python

#What string does match a deep scrubing state in ceph pg’s output?
DEEPMARK=“scrubbing+deep“
#Max concurrent deep scrubs operations
MAXSCRUBS=2

#Set work ratio from first arg; fall back to ‚7‘.
workratio=$1
[ „x$workratio“ == x ] && workratio=7

function isNewerThan() {
# Args: [PG] [TIMESTAMP]
# Output: None
# Returns: 0 if changed; 1 otherwise
# Desc: Check if a placement group „PG“ deep scrub stamp has changed
# (i.e != „TIMESTAMP“)
pg=$1
ots=$2
ndate=$($CEPH pg $pg query -f json-pretty | \
$PYTHON -c ‚import json;import sys; print json.loads(sys.stdin.read())[„info“][„stats“][„last_deep_scrub_stamp“]‘)
nts=$($DATE -d „$ndate“ +%s)
[ $ots -ne $nts ] && return 0
return 1
}

function scrubbingCount() {
# Args: None
# Output: int
# Returns: 0
# Desc: Outputs concurent deep scrubbing tasks.
cnt=$($CEPH -s | $GREP $DEEPMARK | $AWK ‚{ print $1; }‘)
[ „x$cnt“ == x ] && cnt=0
echo $cnt
return 0
}

function waitForScrubSlot() {
# Args: None
# Output: Informative text
# Returns: true
# Desc: Idle loop waiting for a free deepscrub slot.
while [ $(scrubbingCount) -ge $MAXSCRUBS ]; do
sleep 1
done
return 0
}

function deepScrubPg() {
# Args: [PG]
# Output: Informative text
# Return: 0 when PG is effectively deep scrubing
# Desc: Start a PG „PG“ deep-scrub
$CEPH pg deep-scrub $1 >& /dev/null
#Must sleep as ceph does not immediately start scrubbing
#So we wait until wanted PG effectively goes into deep scrubbing state…
local emergencyCounter=0
while ! $CEPH pg $1 query | $GREP state | $GREP -q $DEEPMARK; do
isNewerThan $1 $2 && break
test $emergencyCounter -gt 150 && break
sleep 1
emergencyCounter=$[ $emergencyCounter +1 ]
done
sleep 2
return 0
}

function getOldestScrubs() {
# Args: [num_res]
# Output: [num_res] PG ids
# Return: 0
# Desc: Get the „num_res“ oldest deep-scrubbed PGs
numres=$1
[ x$numres == x ] && numres=20
$CEPH pg dump pgs 2>/dev/null | \
$AWK ‚/^[0-9]+\.[0-9a-z]+/ { if($10 == „active+clean“) { print $1,$23,$24 ; }; }‘ | \
while read line; do set $line; echo $1 $($DATE -d „$2 $3“ +%s); done | \
$SORT -n -k2 | \
$HEAD -n $numres
return 0
}

function getPgCount() {
# Args:
# Output: number of total PGs
# Desc: Output the total number of „active+clean“ PGs
$CEPH pg stat | $SED ’s/^.* \([0-9]\+\) active+clean[^+].*/\1/g‘
}

#Get PG count
pgcnt=$(getPgCount)
#Get the number of PGs we’ll be working on
pgwork=$((pgcnt / workratio + 1))

#Actual work starts here, quite self-explanatory.
logger -t ceph_scrub „About to scrub 1/${workratio} of $pgcnt PGs = $pgwork PGs to scrub“
getOldestScrubs $pgwork | while read line; do
set $line
waitForScrubSlot
deepScrubPg $1 $2
done

logger -t ceph_scrub „Finished batch“

6 Kommentare

  1. Pingback: Skript fĂŒr zeigesteuerte deep-scrubs bei ceph | Johannes Formanns Webseite

  2. hi Johannes

    When I run the script : ceph-deep-scrub 7

    i get his response in syslog but ceph does not start deep scrubbing

    Aug 23 20:41:11 pve1 ceph_scrub: About to scrub 1/7 of 512 PGs = 74 PGs to scrub Aug 23 20:41:12 pve1 ceph_scrub: Finished batch

    Am i doing something wrong ?

  3. hi Shaun,

    you can run
    bash -x ceph-deep-scrub 7
    to see what’s happening. My guess would bee an older ceph version? The output of ceph pg dump has changed over the time, so you might need to change the awk-Line is you’re not using 0.94.

  4. aah, its a Ceph version issue. I’m currently on the latest version of Firefly 0.80.10.

    I’m gonna upgrade to Hammer 0.94.2 tonight. 🙂

Comments are closed.