Viewing: README.lst-survey
Overview
--------
This survey script performs a series of LNet selftest (LST) benchmarks between
groups of LNet peers. It can be used to characterize the performance of the LNet
interface(s) on Lustre servers, Lustre clients, or LNet routers.
The LST client group is defined using the '-f' flag, and the LST server group
is defined using the '-t' flag. Both of these flags take a space-separated or
comma-separated list of LNet NIDs. The '-M' and '-N' options can be used to
divide the client or server group into multiple smaller groups.
For example, given 16 clients and 8 servers, '-M 8' and '-N 2' would create
two client groups with eight peers in each group, and four servers groups with
two peers in each group. Every server group is tested against every client
group, so this would result in 4*2=8 test iterations.
By default, each test iterations performs 4k read and write, 1m read and write,
and ping LST benchmarks.
A directory is created in the current working directory to store results.
The csv output is written to a results.<timestamp>.csv file and the full
lst.sh output is stored in an lst.<timestamp>.out file. An alternative output
directory can be specified with the '-O' argument.
Various options exist to customize the benchmarks that are run. See
'lst-survey -h' for more information.
A note on interpreting the results:
By default, lst-survey displays bandwidth and rate statistics for peers in the
server group as reported by the LST utility.
These statistics reported by LST can be confusing because a "read" test will
typically report read bandwidth that is lower than write bandwidth, and a
"write" test will typically report write bandwidth that is lower than read
bandwidth. This is because a "read" test involves peers in the client group
setting up a sink that is then written to by peers in the server group, and a
"write" test involves the clients setting up a source that is then read by the
servers. Thus, the read test is really measuring the write performance of the
servers and the write test is really measuing the read performance of the
servers.
The '-g clients' option can be used to instead report the client bandwidth and
rate statistics. In this case, the reported stats will align with the benchmarks
in the expected manner.
Example 1: Default options
# pdsh -w n0[0-3] lctl list_nids | dshbak -c
----------------
n00
----------------
172.18.2.5@tcp
----------------
n01
----------------
172.18.2.6@tcp
----------------
n02
----------------
172.18.2.7@tcp
----------------
n03
----------------
172.18.2.8@tcp
# ./lst-survey -t 172.18.2.5@tcp,172.18.2.6@tcp -f 172.18.2.7@tcp,172.18.2.8@tcp
CSV results: /tmp/lst_survey.1666207637/results.1666207637.csv
LST output: /tmp/lst-survey/lst_survey.1666207637/lst.1666207637.out
Commence lst-survey - Wed 19 Oct 2022 01:27:17 PM MDT
Server Group: 172.18.2.5@tcp 172.18.2.6@tcp
Client Group: 172.18.2.7@tcp 172.18.2.8@tcp
Mode Read MB/s Read RPC/s Write MB/S Write RPC/s
read 4k 22 149981 608 299961
read 1m 2 14405 14405 28808
write 4k 489 241229 18 241229
write 1m 11463 22924 1 22924
ping 25 167928 25 167928
Finished lst-survey - Wed 19 Oct 2022 01:28:08 PM MDT
# cat /tmp/lst_survey.1666207637/results.1666207637.csv
Servers,Clients,Mode,Read_BW,Read_Rate,Write_BW,Write_Rate,Server_Errors,Client_Errors
172.18.2.5@tcp 172.18.2.6@tcp,172.18.2.7@tcp 172.18.2.8@tcp,read_4k,22,149981,608,299961,0,0
172.18.2.5@tcp 172.18.2.6@tcp,172.18.2.7@tcp 172.18.2.8@tcp,read_1m,2,14405,14405,28808,0,0
172.18.2.5@tcp 172.18.2.6@tcp,172.18.2.7@tcp 172.18.2.8@tcp,write_4k,489,241229,18,241229,0,0
172.18.2.5@tcp 172.18.2.6@tcp,172.18.2.7@tcp 172.18.2.8@tcp,write_1m,11463,22924,1,22924,0,0
172.18.2.5@tcp 172.18.2.6@tcp,172.18.2.7@tcp 172.18.2.8@tcp,ping,25,167928,25,167928,0,0
#
Example 2: Divide the servers into groups of size 1
# ./lst-survey -t 172.18.2.5@tcp,172.18.2.6@tcp -f 172.18.2.7@tcp,172.18.2.8@tcp -N 1
CSV results: /tmp/lst_survey.1666207844/results.1666207844.csv
LST output: /tmp/lst_survey.1666207844/lst.1666207844.out
Commence lst-survey - Wed 19 Oct 2022 01:30:44 PM MDT
Server Group: 172.18.2.5@tcp
Client Group: 172.18.2.7@tcp 172.18.2.8@tcp
Mode Read MB/s Read RPC/s Write MB/S Write RPC/s
read 4k 25 167068 678 334135
read 1m 2 16186 16186 32366
write 4k 512 252613 19 252612
write 1m 11353 22706 1 22704
ping 29 192358 29 192358
Finished lst-survey - Wed 19 Oct 2022 01:31:34 PM MDT
Commence lst-survey - Wed 19 Oct 2022 01:31:34 PM MDT
Server Group: 172.18.2.6@tcp
Client Group: 172.18.2.7@tcp 172.18.2.8@tcp
Mode Read MB/s Read RPC/s Write MB/S Write RPC/s
read 4k 22 144821 587 289642
read 1m 2 16841 16843 33681
write 4k 498 245552 18 245552
write 1m 11611 23219 1 23217
ping 22 145374 22 145374
Finished lst-survey - Wed 19 Oct 2022 01:32:25 PM MDT
#
Example 3: Divide the servers and clients into groups of size 1
# ./lst-survey -t 172.18.2.5@tcp,172.18.2.6@tcp -f 172.18.2.7@tcp,172.18.2.8@tcp -N 1 -M 1
CSV results: /tmp/lst_survey.1666208473/results.1666208473.csv
LST output: /tmp/lst_survey.1666208473/lst.1666208473.out
Commence lst-survey - Wed 19 Oct 2022 01:41:13 PM MDT
Server Group: 172.18.2.5@tcp
Client Group: 172.18.2.7@tcp
Mode Read MB/s Read RPC/s Write MB/S Write RPC/s
read 4k 11 75112 304 150224
read 1m 1 8808 8809 17616
write 4k 240 118402 9 118402
write 1m 6561 13119 1 13118
ping 13 90402 13 90402
Finished lst-survey - Wed 19 Oct 2022 01:42:03 PM MDT
Commence lst-survey - Wed 19 Oct 2022 01:42:03 PM MDT
Server Group: 172.18.2.5@tcp
Client Group: 172.18.2.8@tcp
Mode Read MB/s Read RPC/s Write MB/S Write RPC/s
read 4k 13 90017 365 180034
read 1m 1 7333 7328 14655
write 4k 280 138173 10 138173
write 1m 8694 17388 1 17388
ping 15 98316 15 98316
Finished lst-survey - Wed 19 Oct 2022 01:42:53 PM MDT
Commence lst-survey - Wed 19 Oct 2022 01:42:53 PM MDT
Server Group: 172.18.2.6@tcp
Client Group: 172.18.2.7@tcp
Mode Read MB/s Read RPC/s Write MB/S Write RPC/s
read 4k 9 64613 262 129225
read 1m 1 9101 9101 18201
write 4k 212 104575 7 104573
write 1m 6769 13537 1 13539
ping 10 71612 10 71612
Finished lst-survey - Wed 19 Oct 2022 01:43:44 PM MDT
Commence lst-survey - Wed 19 Oct 2022 01:43:44 PM MDT
Server Group: 172.18.2.6@tcp
Client Group: 172.18.2.8@tcp
Mode Read MB/s Read RPC/s Write MB/S Write RPC/s
read 4k 12 83144 337 166287
read 1m 1 7582 7584 15166
write 4k 293 144601 11 144602
write 1m 8913 17824 1 17825
ping 11 78409 11 78409
Finished lst-survey - Wed 19 Oct 2022 01:44:35 PM MDT
#