Ceph

We use a Ceph Storage Cluster for our storage system. http://docs.ceph.com/docs/master/
Ceph has an underlying system called RADOS which stores the data. On top of that three different mechanisms are provided to access and use the underlying RADOS layer.

RadosGW: Bucket based REST API for integration with S3/Openstack, need for a rados gateway service/server
RBD: Rados Block Devices, creates a full block device in the storage cluster, can be mounted and used (create filesystem) as any other block device
CEPH FS: A POSIX compliant file system, can be mounted with FUSE driver, need for a metadata service/server

The cluster itself consists of basically two components: Monitors and OSD Daemons

Some important concepts to understand are Pools, the Crush Map and the Placement groups.
Pools: A pool is a logical partition for storing objects. It is basically a part of the overall storage cluster. Each pool has its own number of Placement groups and its own Crush Map. (Different number of replicas, failure domains, etc.)

Placement groups: One pool has a certain number of placement groups, when an object is added it is hashed and put in one placement group, each placement group (based on the number of replicas) puts its object on certain OSDs.
Here is a Placement group calculator for choosing the number of PGs for a pool. https://ceph.com/pgcalc/
More on Placement Groups: http://docs.ceph.com/docs/master/rados/operations/placement-groups/

Placement Groups can be in different states:

Inactive: Cannot provess reads or write because waiting for an OSD with most up-to-date data to come up and in
Unclean: Contains data that is not replicated the desired number of times, should be recovering
Stale: Unkown state - all the osds that host them have not reported in a while

Crush Map: The Crush Map finally maps the objects in a placement group to one or several (number of replicas) OSDs. The crush algorithm can be tweaked to take different failure domains in to account.

Server	Purpose	N-Disks	Raw Space	N-Journal Disks	Journal Space
mon01-cm	Monitor + (CephFS Metadata)	-	-	1 x 800GB	800GB
sto01-cm	OSD Daemon + Monitor	15 x 9,9TB	148,5TB	2 x 120GB	240GB
sto02-cm	OSD Daemon + Monitor	15 x 9,9TB	148,5TB	2 x 120GB	240GB

Pool Name	Purpose	Replica Size	Note
rbd	Default Pool	repl: 2	Used for Rados Block devices
one	Opennebula pool	repl: 2	VM disks for Opennebula
fs	Ceph FS Data Pool	repl: 2	-
fs_meta	Ceph FS Meta Data Pool	repl: 3	-
ec	Erasure Coded Pool	erasure: k=8 m=4	-

To export CephFS namespaces nfs-ganesha needs to be used.

Install packages: sudo apt install libcephfs2 nfs-ganesha nfs-ganesha-ceph ceph-fuse
Copy client keyrings in /etc/ceph and add “ceph.” at the beginning of the keyring files
- client.I11.fs.sto.student.keyring → ceph.client.I11.fs.sto.student.keyring
Make sure you can mount CephFS directories with ceph-fuse, it will be used from nfs-ganesha
- sudo ceph-fuse -n client.I11.fs.sto.student -r /I11/sto/student /mnt/public

Edit nfs-ganesha configuration file: sudo vim /etc/ganesha/ganesha.conf

EXPORT
{
        # Export Id (mandatory, each EXPORT must have a unique Export_Id)
        Export_Id = 2;

        # Exported path (mandatory)
        Path = /I11/sto/student;

        # Pseudo Path (required for NFS v4)
        Pseudo = /mnt/public;

        # Exporting FSAL
        FSAL {
                Name = CEPH;
                User_Id = "I11.fs.sto.student";
        }

        # Export to clients
        CLIENT {
                Clients = 131.159.24.0/23, 172.24.24.0/23;
                Squash = None;
                Access_Type = RW;
        }
}

Restart ganesha service: sudo service nfs-ganesha restart
Have a look into /var/log/ganesha/ganesha.log to check for problems
Mount directory on a remote host: sudo mount -t nfs -o nfsvers=4.1,proto=tcp cephex:/mnt/public /data/ceph/public

Number of Placement Groups

Get PGs Number

Get placement groups in a pool
```
ceph osd pool get <pool-name> pg_num
```

Set PGs Number

First calculate number of PGs according to environment: Ceph PG Calc
Number of Placement Groups can not be decreased!

 ceph osd pool set <pool-name> pg_num <pg-number>
ceph osd pool set <pool-name> pgp_num <pg-number>
#first command splits data, second command makes the number available to crush algorithm
# values should be equal

Cluster Status

Ceph Version
```
ceph tell mon.* version
```
Show config of ceph daemons, log in on the host the daemon runs on!
```
ceph daemon mds.sto01 config show
ceph daemon osd.0 config show
```
Short Status
```
ceph health
```
Storage Cluster status
```
ceph status
```
Cluster usage stats: Show free space in the pools
```
ceph df
```
List users and their capabilities
```
ceph auth list
```
Very quick benchmark of a cluster pool, simple throughput benchmark with 1GB in 4-MB increments
```
ceph tell osd.<osd-id> bench
```
Monitor Map: Contains cluster fsid, position, name address and port of each monitor, indicates current epoch, when map was created, when map was last changed
```
ceph mon dump
```
OSD Map: Contains cluster fsid, when map was created and last modified, list of pools, replica sizes, PG numbers, list of OSDs and their status (up,in)
```
ceph osd dump
```
PG Map: Contains PG version, time stamp, last ODS map epoch, full ratios, details on each Placement group such as PG ID, UP Set, Acting Set, state of PG and data usage for each pool
```
ceph pg dump
ceph pg map <pg-num>
```

Crush Map: Contains a list of storage devices, failure domain hierarchy when storing data

ceph osd getcrushmap -o <output-file>
#decompile map
crushtool -d <crushmap> -o <decompiled-crushmap>
#view with vim or cat

MDS Map: metadata server map, when map was created and last time changed, conatins pool for storing metadata, list of metadata servers and which are up (up,in)
```
ceph fs dump
```
MDS Clients: show all cephfs clients connected to this metadata server - using the filesystem
```
ceph tell mds.0 session ls
```

Add OSDs

Prepare the OSD

ceph-deploy osd prepare <node>:<path-to-device/directory>

Activate the OSD

ceph-deploy osd activate <node>:<path-to-device/directory>

CephFS commands

List all connected fs client / mounted directories
```
ceph tell mds.* session ls
```

Remove OSDs

Change crush weight for rebalancing
```
 ceph osd crush reweight osd.<ID> 0.0 
```
Weight for rebalance to complete!

Completely remove the OSD

#take the osd out of the cluster
ceph osd out <ID>
#stop the osd daemon for that drive on the host it is running
sudo systemctl stop ceph-osd@<ID>
#remove osd from crush map
ceph osd crush remove osd.<ID>
#remove authentication key
ceph auth del osd.<ID>
#remove the osd
ceph osd rm <ID>

(optional)
#delete partition table on node
sudo umount /dev/<drive>
sudo wipefs -a /dev/<drive>

Pause Cluster (Node Reboot/Maintenance)

Disable rebalancing
```
ceph osd set nodown
ceph osd set noout
```

Also stop CephFS and MDs Services

sudo umount -lf /mnt/cephfs
sudo service ceph stop mds

Disable scrubing and deep-scrubing

ceph osd set noscrub
ceph osd set nodeep-srub

Reenable all the servcices

sudo service ceph start osd.<ID>
sudo service ceph start mds
sudo mount /mnt/cephfs
ceph osd unset noscrub
ceph osd unset nodeep-scrub
ceph osd unset noout
ceph osd unset nodown

Adjust Crush Map

Get Crush Map
```
ceph osd getcrushmap -o crush-com
```
Decompile Crush Map
```
crushtool -d crush-com -o crush-dec
```
Edit Map with text editing tool
```
vim crush-dec
```
Compile new Crush Map
```
crushtool -c crush-dec -o crush-com
```
Set new Map for usage in cluster
```
ceph osd setcrushmap -i crush-com
```

The actual Map consists of four components:
- Devices: One device for each OSD daemon/disk
- Bucket Type: Define the buckets used in Crush hierarchy
- Bucket Instance: Define the buckets to each other, sets the devices in failure domains
- Rules: Rules determine data placement for pools
More Information on the Crush Map can be found here

CLI Crush Map Configuration

Add/Move an OSD inside Crush Map

 ceph osd crush set <ID> <weight-TB> root=<root-of-tree> ...
ceph osd crush set 0 9.01598 root=default host=sto01-1

Remove OSD from Crush Map
```
ceph osd crush remove osd.<ID>
```

Add bucket

ceph osd crush add-bucket <name> <type>
ceph osd crush add-bucket sto01-1 host

Move bucket

ceph osd crush move <name> <crush-location>
ceph osd crush move sto01-1 root=ssd room=room1

Delete bucket
```
ceph osd crush remove <name>
```

Pool commands

Set the replication size of the pool

ceph osd pool set <name> size <repl-size>

Get the replication size
```
ceph osd dump | grep 'replicated size'
```

Create a new pool

ceph osd pool create <name> <pg_num> <pgp_num> replicated

List all Pools
```
ceph osd pool ls
```

Delete a pool

ceph osd pool delete <name> <name> --yes-i-really-really-mean-it

Rename a pool
```
ceph osd pool rename <name> <new-name>
```

Control Ceph Services

List all ceph services/units on a node

sudo systemctl status ceph\*.service ceph\*.target

Restart all osd services

sudo systemctl stop ceph.target
sudo systemctl start ceph.target

Controlling daemons by type

sudo systemctl stop ceph-mon\*.service ceph-mon.target
sudo systemctl stop ceph-osd\*.service ceph-osd.target
sudo systemctl stop ceph-mds\*.service ceph-mds.target
#start services
sudo systemctl start ceph-osd.target
sudo systemctl start ceph-mon.target
sudo systemctl start ceph-mds.target

Start specific daemons

sudo systemctl start ceph-osd@{id}
sudo systemctl start ceph-mon@{hostname}
sudo systemctl start ceph-mds@{hostname}

Rados Block Devices

Create new Block Device

rbd create --size 4096 <pool>/<name>
sudo rbd feature disable <pool>/<name> exclusive-lock object-map fast-diff deep-flatten
sudo rbd map <pool>/<name> --name client.admin
sudo mkfs.ext4 -m0 /dev/rbd1
sudo mount /dev/rbd1 /mnt

List all block devices

rbd ls
#list in specific pool
rbd ls <pool>

Retrieve image information
```
rbd info <pool>/<image>
```

Resize Block Device

rbd resize --size 2048 <pool>/<name> # increase
rbd resize --size 2048 <pool>/<name> --allow-shrink #decrease

Remove RBD

sudo umount /mnt
sudo rbd unmap /dev/rbd1
rbd rm <pool>/<name>

#if still watchers look where
rbd showmapped
sudo service rbdmap stop
rbd rm <pool>/<name>

#or
rbd info <pool>/<name>
rados -p rbd listwatchers rbd_header.<end-of-block-prefix-number>

Create new user and mount storage

Get all users and cluster rights / authorization
```
ceph auth list
```

Log in on a ceph admin node

ceph auth add client.fs_user mon 'allow r' osd 'allow rwx pool=fs, allow rwx pool=fs_meta' mds 'allow r'

Get user key

ceph auth get-key client.fs_user | tee client.fs_user.key

Get user keyring

ceph auth get client.fs_user -o ceph.client.fs_user.keyring

Copy the key and keyring to the node

Install ceph on node

ceph-deploy install 10.0.60.4
ceph-deploy config push 10.0.60.4

Now log in to the node. Make sure that the md servers and all storage servers are reachable by ip and hostname! (Add maas-10.0.10.1 as DNS server)

Move secret and key to ceph directory

sudo mv ~/ceph.client.fs_user.keyring /etc/ceph/
sudo mv ~/client.fs_user.key /etc/ceph/

Check Authentication
```
ceph --id=fs_user health
```
Install Ceph Filesystem package
```
sudo apt install ceph-fs-common
```

Mount Filesystem

sudo mount -t ceph 10.0.10.1:6789:/ /mnt/cephfs -o name=fs_user,secretfile=/etc/ceph/client.fs_user.key

Change user permission

Change permissions

ceph auth caps client.fs_user mds ' allow rw' mon 'allow r' osd 'allow rwx pool=fs'

Path restriction

Path restriction works only when mounted with ceph-fuse

MDS: allow read + write on path

mds 'allow r' -> allow read only for whole fs
mds 'allow rw' -> allow read and write for whole fs
mds 'allow r, allo rw path=/data_tonetto' -> allow read for whole fs and only write for path data_tonetto
mds 'allow rw path=/datasets' -> allow read and write only to the path datasets

MON: only read necessary

mon ' allow r' -> get object map, access to cluster

OSD: allow read and write to fs pool

osd 'allow rwx pool=fs' -> allow read and write to fs pool

All command for path restriction

ceph auth caps client.fs_user mds 'allow rw path=/datasets, allow rw path=/data_tonetto' mon 'allow r' osd 'allow rw pool=fs'
#on client
#install fuse
sudo apt install ceph-fuse
#mount directories
sudo ceph-fuse -n client.fs_user --keyring=/etc/ceph/ceph.client.fs_user.keyring -r /data_tonetto /data
sudo ceph-fuse -n client.fs_user --keyring=/etc/ceph/ceph.client.fs_user.keyring -r /datasets /datasets

The installation will be made from one server with ceph-deploy.
Following this guide: http://docs.ceph.com/docs/master/rados/deployment/

Initial installation of ceph-admin and nodes

ssh mon01-cm
mkdir sto_cluster
cd sto_cluster
wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
echo deb https://download.ceph.com/debian-kraken/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
sudo apt-get update && sudo apt-get install ceph-deploy

ceph-deploy new mon01-cm
vim ceph.conf
------------------------------------------------------------
[global]
fsid = b2fe6c5c-10d5-4eb3-af02-121d6493d6bf
mon_initial_members = mon01
mon_host = 10.0.10.1
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = 10.0.0.0/16

osd_journal_size = 14336
#reasonable number of replicas and placement groups
osd_pool_default_size = 3 #Write an object 3 times
osd_pool_default_min_size = 1 #Allow writing 1 copy in degraded state
osd_pool_default_pg_num = 256
osd_pool_default_pgp_num = 256
------------------------------------------------------------
#install python on nodes sto01, sto02
sudo apt install python python-apt
# enable nat on gateway
ceph-deploy install mon01-cm sto01 sto02
ceph-deploy disk list sto01
ceph-deploy mon create-initial
#sometimes two monitors are running mon01-cm + mon01
#stop mon01: sudo systemctl stop ceph-mon@mon01
#run create-initial command again
sudo chmod +r /etc/ceph/ceph.client.admin.keyring
ceph-deploy admin sto01 sto02
#run command on both nodes: sudo chmod +r /etc/ceph/ceph.client.admin.keyring
ceph status

Add OSDs

ceph-deploy osd prepare sto01:/dev/sdb sto01:/dev/sdc sto01:/dev/sdd sto01:/dev/sde sto01:/dev/sdf sto01:/dev/sdg sto01:/dev/sdh sto01:/dev/sdi sto01:/dev/sdj sto01:/dev/sdk sto01:/dev/sdl sto01:/dev/sdo sto01:/dev/sdp sto01:/dev/sdq sto01:/dev/sdr sto02:/dev/sdb sto02:/dev/sdc sto02:/dev/sdd sto02:/dev/sde sto02:/dev/sdf sto02:/dev/sdg sto02:/dev/sdh sto02:/dev/sdi sto02:/dev/sdj sto02:/dev/sdk sto02:/dev/sdl sto02:/dev/sdo sto02:/dev/sdp sto02:/dev/sdq sto02:/dev/sdr

ceph-deploy osd activate sto01:/dev/sdb1 sto01:/dev/sdc1 sto01:/dev/sdd1 sto01:/dev/sde1 sto01:/dev/sdf1 sto01:/dev/sdg1 sto01:/dev/sdh1 sto01:/dev/sdi1 sto01:/dev/sdj1 sto01:/dev/sdk1 sto01:/dev/sdl1 sto01:/dev/sdo1 sto01:/dev/sdp1 sto01:/dev/sdq1 sto01:/dev/sdr1 sto02:/dev/sdb1 sto02:/dev/sdc1 sto02:/dev/sdd1 sto02:/dev/sde1 sto02:/dev/sdf1 sto02:/dev/sdg1 sto02:/dev/sdh1 sto02:/dev/sdi1 sto02:/dev/sdj1 sto02:/dev/sdk1 sto02:/dev/sdl1 sto02:/dev/sdo1 sto02:/dev/sdp1 sto02:/dev/sdq1 sto02:/dev/sdr1

#cluster to reach health ok
ceph osd pool rbd set size 2

Create Erasure Coded pool

A profile must be set when creating a new pool. The profile can not be changed later! A new pool has to be created and all the data moved from the first pool to the second.

#create custom profile
ceph osd erasure-code-profile get default
ceph osd erasure-code-profile set fs1 k=8 m=4 ruleset-failure-domain=osd
#optional take another pool
ceph osd erasure-code-profile set ruleset-root=ssd

#create pool
ceph osd pool create ec 12 12 erasure fs1

Create and Mount CephFS

#create FS pools - recommended to use higher replication level for metadata pool 
#any data loss can render whole filesystem inaccessible
ceph osd pool create fs 256 256 replicated
ceph osd pool create fs_meta 256 256 replicated
ceph osd pool set fs size 2
ceph osd pool set fs_meta size 3

#create filesystem
ceph fs new ceph_fs fs_meta fs

#mount filesystem on client
ceph-deploy install emu11
#on client:
sudo apt install ceph-fs-common ceph-fuse
cat /etc/ceph/ceph.client.admin.keyring
#copy only key into new file /etc/ceph/admin.secret
sudo mount -t ceph 10.0.10.1:6789:/ /mnt/cephfs -o name=admin,secretfile=/etc/ceph/admin.secret
#unmount ceph-fuse / ceph kernel client
sudo fusermount -u /mnt/cephfs
sudo umount /mnt/cephfs

Benchmark cluster

Commands

#network benchmark
sudo apt install iperf
sto01: iperf -s
emu12: iperf -c sto01
#disk benchmark
sudo hdparm -tT /dev/sdc
sudo hdparm -tT --direct /dev/sdc
sudo mount /dev/sdc /mnt/tmp
cd /mnt/tmp
sudo dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 
sudo dd if=/dev/zero of=tempfile2 bs=1G count=5 conv=fdatasync,notrunc
sudo dd if=tempfile of=/dev/null bs=1M count=1024
sudo dd if=tempfile2 of=/dev/null bs=1G count=5

#bench rados cluster
#normal cluster
rados bench -p rbd 60 write --no-cleanup
#read random for 60 seconds
rados bench -p rbd 60 rand

#Cleanup Mess on every pool
rados -p <pool> cleanup

Data

#network benchmark
sudo apt install iperf
sto01: iperf -s
emu12: iperf -c sto01
  9.40 Gbits/sec - sto01 -> sto02
  9.41 Gbits/sec - sto01 -> sto02
  9.39 Gbits/sec - emu12 -> sto02
  9.35 Gbits/sec - emu12 -> sto01
#disk benchmark
sudo hdparm -tT /dev/sdc
 Timing cached reads:   20318 MB in  2.00 seconds = 10168.52 MB/sec
 Timing buffered disk reads: 700 MB in  3.01 seconds = 232.81 MB/sec
sudo hdparm -tT --direct /dev/sdc
 Timing O_DIRECT cached reads:   886 MB in  2.00 seconds = 442.57 MB/sec
 Timing O_DIRECT disk reads: 358 MB in  3.00 seconds = 119.31 MB/sec
sudo mkdir /mnt/tmp
sudo mount /dev/sdc /mnt/tmp
cd /mnt/tmp
sudo dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc 
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.10234 s, 210 MB/s
sudo dd if=/dev/zero of=tempfile2 bs=1G count=5 conv=fdatasync,notrunc
    5368709120 bytes (5.4 GB, 5.0 GiB) copied, 25.7669 s, 208 MB/s
sudo dd if=tempfile of=/dev/null bs=1M count=1024
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.204041 s, 5.9 GB/s
sudo dd if=tempfile2 of=/dev/null bs=1G count=5
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.280824 s, 4.8 GB/s
#disk benchmark - ssd
sudo hdparm -tT /dev/sdn4
 Timing cached reads:   20768 MB in  2.00 seconds = 10394.23 MB/sec
 Timing buffered disk reads: 1458 MB in  3.00 seconds = 485.62 MB/sec
sudo hdparm -tT --direct /dev/sdn4
 Timing O_DIRECT cached reads:   646 MB in  2.00 seconds = 322.43 MB/sec
 Timing O_DIRECT disk reads: 1494 MB in  3.00 seconds = 497.91 MB/sec
sudo mkfs.xfs -f -i size=2048 /dev/sdn4
sudo dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.14751 s, 135 MB/s
sudo dd if=/dev/zero of=tempfile2 bs=1G count=5 conv=fdatasync,notrunc
    5368709120 bytes (5.4 GB, 5.0 GiB) copied, 40.4716 s, 133 MB/s
sudo dd if=tempfile of=/dev/null bs=1M count=1024
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.204041 s, 5.8 GB/s
sudo dd if=tempfile2 of=/dev/null bs=1G count=5
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.280824 s, 5.2 GB/s

#bench rados cluster
#normal cluster
rados bench -p rbd 60 write --no-cleanup
2017-03-30 16:58:01.468876 min lat: 0.0414839 max lat: 1.96491 avg lat: 0.378728
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16      2529      2513   167.516        96    0.265697    0.378728
Total time run:         60.665624
Total writes made:      2530
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     166.816
Stddev Bandwidth:       57.3734
Max bandwidth (MB/sec): 336
Min bandwidth (MB/sec): 84
Average IOPS:           41
Stddev IOPS:            14
Max IOPS:               84
Min IOPS:               21
Average Latency(s):     0.383575
Stddev Latency(s):      0.259392
Max latency(s):         1.96491
Min latency(s):         0.0414839

# two concurrent on both nodes
Total time run:         30.422575
Total writes made:      744
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     97.8221
Stddev Bandwidth:       73.2887
Max bandwidth (MB/sec): 352
Min bandwidth (MB/sec): 16
Average IOPS:           24
Stddev IOPS:            18
Max IOPS:               88
Min IOPS:               4
Average Latency(s):     0.653725
Stddev Latency(s):      0.609964
Max latency(s):         3.22897
Min latency(s):         0.0601762

#read random for 60 seconds
rados bench -p rbd 60 rand
Total time run:       60.054755
Total reads made:     39269
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2615.55
Average IOPS:         653
Stddev IOPS:          26
Max IOPS:             725
Min IOPS:             604
Average Latency(s):   0.0237641
Max latency(s):       0.179591
Min latency(s):       0.00335397

#ssd journal cluster
rados bench -p data-ssd 60 write --no-cleanup
Total time run:         60.250558
Total writes made:      2516
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     167.036
Stddev Bandwidth:       11.7987
Max bandwidth (MB/sec): 196
Min bandwidth (MB/sec): 144
Average IOPS:           41
Stddev IOPS:            3
Max IOPS:               49
Min IOPS:               36
Average Latency(s):     0.383063
Stddev Latency(s):      0.126893
Max latency(s):         1.07908
Min latency(s):         0.0446201

# two concurrent on both nodes
Total time run:         30.744405
Total writes made:      682
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     88.7316
Stddev Bandwidth:       14.6719
Max bandwidth (MB/sec): 140
Min bandwidth (MB/sec): 60
Average IOPS:           22
Stddev IOPS:            3
Max IOPS:               35
Min IOPS:               15
Average Latency(s):     0.713941
Stddev Latency(s):      0.306251
Max latency(s):         1.82924
Min latency(s):         0.0950345

#reads
Total time run:       60.049589
Total reads made:     37142
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2474.09
Average IOPS:         618
Stddev IOPS:          24
Max IOPS:             669
Min IOPS:             561
Average Latency(s):   0.0251675
Max latency(s):       0.208302
Min latency(s):       0.00336694

#---------------------------------------
#rados Block Device
sudo rbd create image02 --size 4096 --pool data-ssd
sudo rbd feature disable image02 exclusive-lock object-map fast-diff deep-flatten --pool data-ssd
sudo rbd map image02 --pool data-ssd --name client.admin
sudo mkfs.ext4 -m0 /dev/rbd1
sudo mount /dev/rbd1 /mnt/device-bl2
rbd bench-write image02 --pool data-ssd

rbd - pool
elapsed:    12  ops:   262144  ops/sec: 21747.94  bytes/sec: 89079547.35

data-ssd - pool
elapsed:    11  ops:   262144  ops/sec: 23236.06  bytes/sec: 95174905.63

RBD Normal Benchmarks

sudo hdparm -tT /dev/rbd0 # normal
 Timing cached reads:   20254 MB in  2.00 seconds = 10139.08 MB/sec
 Timing buffered disk reads: 2722 MB in  3.00 seconds = 907.20 MB/sec
sudo hdparm --direct -tT /dev/rbd0
 Timing O_DIRECT cached reads:   3720 MB in  2.00 seconds = 1860.49 MB/sec
 Timing O_DIRECT disk reads: 4096 MB in  2.96 seconds = 1384.69 MB/sec

sudo hdparm -tT /dev/rbd1 #ssd
 Timing cached reads:   20072 MB in  2.00 seconds = 10047.58 MB/sec
 Timing buffered disk reads: 2766 MB in  3.00 seconds = 921.83 MB/sec
sudo hdparm --direct -tT /dev/rbd1
 Timing O_DIRECT cached reads:   3872 MB in  2.00 seconds = 1936.37 MB/sec
 Timing O_DIRECT disk reads: 4096 MB in  1.77 seconds = 2311.77 MB/sec
#normal
sudo dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.1852 s, 257 MB/s
sudo dd if=tempfile of=/dev/null bs=1M count=1024
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.184213 s, 5.8 GB/s
#ssd
sudo dd if=/dev/zero of=tempfile2 bs=1M count=1024 conv=fdatasync,notrunc
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.91692 s, 155 MB/s
sudo dd if=tempfile of=/dev/null bs=1M count=1024
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.184213 s, 5.8 GB/s

Conclusion

Raw Network Speed
iperf -s
iperf -c <address>
  9.40 Gbits/sec - sto01 -> sto02
  9.41 Gbits/sec - sto01 -> sto02
  9.39 Gbits/sec - emu12 -> sto02
  9.35 Gbits/sec - emu12 -> sto01

  -> ca. 1170 MB/sec

Raw Speed
sudo hdparm -tT /dev/<drive>

ssd
        Timing cached reads:    10220.44 MB/sec
        Timing cached reads:    10014.83 MB/sec
        Timing cached reads:    10406.89 MB/sec
        Timing cached reads:    10233.53 MB/sec

        Timing buffered reads:  321.97 MB/sec
        Timing buffered reads:  372.85 MB/sec
        Timing buffered reads:  324.69 MB/sec
        Timing buffered reads:  371.13 MB/sec

normal
        Timing cached reads:    9862.59 MB/sec
                                10251.11 MB/sec
                                10005.25 MB/sec
                                10581.66 MB/sec
        Timing buffered reads:  237.98 MB/sec
                                240.19 MB/sec
                                233.71 MB/sec
                                208.83 MB/sec

ssd-DIRECT

        Timing cached reads:    345.82 MB/sec
                                372.97 MB/sec
                                377.45 MB/sec
                                370.21 MB/sec

        Timing buffered reads:  404.91 MB/sec
                                436.78 MB/sec
                                436.65 MB/sec
                                411.83 MB/sec

normal-DIRECT

        Timing cached reads:    879.61 MB/sec
                                853.61 MB/sec
                                437.60 MB/sec
                                867.01 MB/sec

        Timing buffered reads:  119.51 MB/sec
                                121.01 MB/sec
                                175.96 MB/sec
                                112.51 MB/sec
Rados Bench
rados bench -p <pool> 10/20/30/60 write --no-cleanup

ssd-primary
        Bandwidth (MB/sec):     246.749
        Bandwidth (MB/sec):     216.592
        Bandwidth (MB/sec):     232.526
        Bandwidth (MB/sec):     186.368

ssd-only
        Bandwidth (MB/sec):     118.422
        Bandwidth (MB/sec):     121.656
        Bandwidth (MB/sec):     116.717
        Bandwidth (MB/sec):     108.17

rbd-normal
        Bandwidth (MB/sec):     260.393
        Bandwidth (MB/sec):     282.246
        Bandwidth (MB/sec):     268.01
        Bandwidth (MB/sec):     271.671
        
erasure-coded
        Bandwidth (MB/sec):     244.602
        Bandwidth (MB/sec):     231.217
        Bandwidth (MB/sec):     242.125
        Bandwidth (MB/sec):     229.342

Rados Read Seq
rados bench -p <pool> 10/20/30 seq

ssd-primary
        Bandwidth (MB/sec):   975.046
        Bandwidth (MB/sec):   1025.61
        Bandwidth (MB/sec):   1030.69
ssd-only
        Bandwidth (MB/sec):   1071.45
        Bandwidth (MB/sec):   1065.46
        Bandwidth (MB/sec):   1045.66
rbd-normal
        Bandwidth (MB/sec):   1009.87
        Bandwidth (MB/sec):   1022.06
        Bandwidth (MB/sec):   1054.07
erasure-coded
        Bandwidth (MB/sec):   962.095
        Bandwidth (MB/sec):   956.349
        Bandwidth (MB/sec):   974.866

Rados Read Rand
rados bench -p <pool> 10/20/30 rand

ssd-primary
        Bandwidth (MB/sec):   993.921
        Bandwidth (MB/sec):   1034.7
        Bandwidth (MB/sec):   1062.02
ssd-only
        Bandwidth (MB/sec):   1055.95
        Bandwidth (MB/sec):   1039.78
        Bandwidth (MB/sec):   1063.78
rbd-normal
        Bandwidth (MB/sec):   1037.47
        Bandwidth (MB/sec):   1057.92
        Bandwidth (MB/sec):   1049.02
erasure-coded
        Bandwidth (MB/sec):   964.135
        Bandwidth (MB/sec):   952.147
        Bandwidth (MB/sec):   950.543

--> Network bottleneck: 1170 MB/sec

--> Single Hard Drive Read: 
    230 MB/sec normal
    130 MB/sec direct 
--> Single SSD Read:
    330 MB/sec normal
    430 MB/sec direct   
--> All Cluster Read Operations are capped by network: 1050 MB/sec
--> Erasure Coded Pool slightly less 955 MB/sec

--> Single Hard Drive Write:
    210 MB/sec    
--> Single SSD Write:
    130 MB/sec
--> Cluster Write Operations:
    265 MB/sec normal
    120 MB/sec ssd
    235 MB/sec erasure-coded

Ceph

Overview Infrastructure

CephFS Export through NFS

Usage

Number of Placement Groups

Get PGs Number

Set PGs Number

Cluster Status

Add OSDs

Add OSDs

CephFS commands

CephFS commands

Remove OSDs

Remove OSDs

Pause Cluster (Node Reboot/Maintenance)

Adjust Crush Map

CLI Crush Map Configuration

Pool commands

Control Ceph Services

Rados Block Devices

Create new user and mount storage

Installation

Commands

Data

Conclusion

Wiki - Chair of Connected Mobility