The Engineering Applications Group, EAG, at Cray Research has worked for years helping vendors optimize applications on Cray Research supercomputers. Much of this work concentrated on vectorizing and parallelizing the codes to exploit the ever increasing hardware speed. However, little work was done to ensure that the I/O performance of these codes on Cray Research systems was optimal. With the improvements in performance due to code optimization, the I/O requirements of large-scale applications have become even more obvious. For many large runs in structural analysis applications, the I/O wait time would often exceed the CPU time. This overhead increases the time to solution and also makes gains through parallel processing less effective in reducing time to solution.
In early 1992 a project was started within EAG to allow applications to take full advantage of the high performance I/O features of Cray Research systems. EAG has been developing an I/O library, libeag_ffio.a, which will allow individual applications to:
This documentation is intended to aid users of eag_ffio in the understanding of the I/O in their programs and how to include and utilize eag_ffio in the optimization of the I/O performance.
Although eag_ffio was originally written for Cray Research Systems running the UNICOS operating system it has been ported to IRIX with much of the functionality intact.
An understanding of the following definitions is essential when attempting to optimize the I/O of any program.
The following program will write four bytes from memory to a file.
/* 1 */ #include <fcntl.h>
/* 2 */ #include <unistd.h>
/* 3 */ main()
/* 4 */ {
/* 5 */ int fd, ret ;
/* 6 */ char *data_ptr = "abcd" ;
/* 7 */ fd = open("file.dat", O_RDWR|O_CREAT , 0640 ) ;
/* 8 */ ret = lseek( fd , 3001 , SEEK_SET ) ;
/* 9 */ ret = write( fd , data_ptr , 4) ;
/* 10 */ close(fd);
/* 11 */ }
line 7 : The open kernel call returns a file descriptor, fd, which the
program uses to reference the file for future I/O operations.
line 8 : The lseek kernel call informs the kernel to set the current
position of file fd to byte 3001.
line 9 : The write kernel call requests that the kernel copy four bytes
from the user's memory location pointed to by data_ptr, to the
file fd, starting at the current position.
What must the kernel do to satisfy this write request?
When examining the I/O performance of a program there are four items of interest:
As a rule, the greater the number of logical I/O requests, the greater the system cpu time. It is generally desirable to reduce the number of logical I/O requests by increasing the size of each logical I/O request. This has a double impact. There are fewer logical I/O requests for the kernel to handle, reducing the system cpu time. The I/O wait times will be reduced since the fixed startup time of any I/O request will be spread out over the larger requests.
To understand how to tune I/O, one first must have an understanding of the I/O measurement tools that are available to gauge the performance. The two main tools are ja (job accounting) and procstat (process statistics).
The first (and simplest) tool is ja. The output from ja will give the I/O wait times, amount of data transferred, and the number of logical and physical requests.
Sample output of relevant I/O fields from "ja -cl":
Job Accounting - Command Report
===============================
Command ... I/O Wait I/O Wait ... Kwords Log I/O Phy I/O
Name ... Sec Lck Sec Unlck ... Xferred Request Request
======= ... ========= ========== ... ======== ======= =======
nastbio ... 4.4296 0.5219 ... 24988.00 1213 1300
Descriptions of the fields (taken from "man ja"):
Other useful options are "-se". These options together produce extended summary report:
-e Generates an extended summary report; you must use -e with the -s option. The following are descriptions of fields produced by specifying the -e option with the -s option. These fields provide additional accumulated statistics for the reporting period. Several fields contain values only if performance accounting has been enabled; otherwise, the string NA is printed instead. System Call Time Total amount of time (in seconds) that the processes executed system calls. I/O Wait Time (Terminals) Total amount of time in seconds that the processes waited for I/O from and to terminals. This field contains a significant value only if performance accounting is enabled (see devacct(8)). Wait Time while Swapped Total amount of time (in seconds) that the processes waited while swapped out of memory. This field contains a significant value only if performance accounting is enabled (see devacct(8)). Number of Swaps Number of times the processes were swapped out of memory. Physical Blocks Moved (Bufd I/O) Number of physical blocks transferred by processes to and from block devices by using the system buffer I/O interface. This field contains a significant value only if performance accounting is enabled (see devacct(8)). Physical Blocks Moved (Raw I/O) Number of physical blocks transferred by processes to and from block devices by using the raw I/O interface. This field contains a significant value only if performance accounting is enabled (see devacct(8)).
Recommended usage:
procstat -R file_name command arg1 arg2 ... argn
The procstat command produces information on the current activity of a specified command. The information always includes process start and exit, and by default includes memory activity and I/O usage (including SDS usage). The recommended usage of procstat is with the -R option which will produce a run time statistics data file which can be viewed with procview.
Where ja reports total I/O performed by a program, procstat will report I/O activity, on a per-file basis, between the program and the kernel. Procview will report the number of reads, writes, and bytes transferred for each file open/close pair. Additional information includes maximum file sizes, I/O transfer rates, I/O wait times, and types of I/O.
When a program is being developed, the programmer has a wide variety of ways in which to perform I/O. On Cray Research platforms, some of the more common ones are:
When using the eag_ffio library, it is possible to trap I/O which was initiated in any of the above modes. The I/O may then pass through any of the standard UNICOS ffio layers ( sds,mr, cos, ... ) in addition to the layers available in the eag_ffio library ( eie, event, set).
Some of these modes already use ffio by default. Fortran sequential unformatted I/O uses ffio with the cos layer to provide the required cos blocking. Fortran direct access unformatted I/O uses the cachea layer to buffer the fortran records.
Fortran sequential formatted I/O uses the Standard C I/O library to buffer its I/O. Since it is possible to trap Standard C I/O, it is possible to trap Fortran sequential formatted I/O.
In the standard UNICOS libraries, AQIO uses system calls by default. The user must use an "assign -F system filename" command to force the AQIO package to use ffio for the given file, which will allow it to be trapped by eag_ffio.
On UNICOS systems libeag_ffio.a is delivered as an unsupported and undocumented library with Programming Environment 3.0 and later. Include the libeag_ffio when linking
cc -o main main.c -leag_ffio
On IRIX systems libeag_ffio.so is delivered as an unsupported and undocumented library with MIPSpro 7.2.1 and later. Include the libeag_ffio when linking
cc -o main main.c -lffio -leag_ffio
To check if the link successfully included the eag_ffio library:
what a.out | grep ffopen
should produce a line similar to the following:
trap_ffopen.c 6.4BF0 Jul 31 1993 YMP UNICOS 70There are 4 environment variables that affect the actions of the eag_ffio library:
Two others, FF_IO_TRACE_FILE and FF_IO_RECOVER_CMD , are discussed in subsequent sections.
The eag_ffio library interprets $FF_IO_LOGFILE as a file to be opened as a destination for ffio diagnostics. eie.diag or event.summary (if requested) output will be sent to $FF_IO_LOGFILE. There are two special cases for the value of FF_IO_LOGFILE, stderr and stdout. If FF_IO_LOGFILE is set to either of these values, the eag_ffio library will use the respective standard I/O stream for output rather than files named stderr or stdout. The default action of eag_ffio is to overwrite the existing log file if it exists. The user may have the log file appended to by prefixing the logfile name with a "+".
example:
setenv FF_IO_LOGFILE +/ptmp/jlb/problem.ffio.log
If FF_IO_OPEN_DIAGS is set to any value, and FF_IO_LOGFILE is also set, diagnostic messages concerning the template name matching and layer invocations will be written to $FF_IO_LOGFILE. Setting this variable may produce a lot of ouput in the logfile, but provides a convenient check of all files used by a user program which are trapable by the eag_ffio library.
This is a critical environment variable for the ffio library. It is in FF_IO_OPTS that the user indicates which layers are to be invoked for selected files. The format of FF_IO_OPTS is a file name template followed by layer specifications enclosed in parens.
" *.dat ( event.summary | eie.mem | system ) "
|_____| |__________________________________|
| |
| |__ layer specification string
|
|
|_____ filename template
As each file is opened by ffopen, ffopen attempts to match the incoming file name with the supplied templates in FF_IO_OPTS, reading left to right. Upon finding a match, ffopen will invoke the layers that are specified between the next pair of paren. More than one template may be used for each layer specification string, and there may be more than one pair of template strings/layer specifications. A more general FF_IO_OPTS follows:
*.dat *.save (event) fort.1* fort.2* ( eie | event )"
|__________| |_____| |_____________| |_____________|
| | |____ layer spec string #2
| | |__ file template #2
| |
| |____ layer spec string #1
|_____ file template #1
When FF_IO_OPEN_DIAGS and FF_IO_LOGFILE are both set, a diagnostic line similar to the following will be written to $FF_IO_LOGFILE for each open of a file.
ffopen(/tmp/jlb/statics/matrix.lu) ft:12 gt:DAU:251The string between the parens is the pathname of the file to be opened. The file tag of the file follows the ft:, and the group tag follows the gt:. The file tag is an integer value that should be unique to the file. For UNICOS 8.0 and later libraries, the file tag is the Fortran unit number for Fortran opens. The group tag is an integer value that indicates how the file is being opened by the user program. For user convenience there are several strings that are equivalent to the appropriate integer value for groups tags. They are:
| String | Value | Description |
|---|---|---|
| WA | 255 | Word addressable package I/O |
| AQ | 254 | AQIO package |
| SQU | 253 | Fortran sequential unformatted |
| SQF | 252 | Fortran sequential formatted |
| DAU | 251 | Fortran direct access unformatted |
| DAF | 250 | Fortran direct access formatted |
| EVT | 249 | event trace file I/O |
| STD | 247 | Standard C I/O (fopen, fread, fwrite, ...) |
File templates may match the pathname, the file tag, or the group tag. For the above ffopen, any of the following would result in a template match:
Multiple integers may be specifed on a single ft: or gt: template. The syntax for an inclusive series of integers is the start value and end value seperated by a -.
example:
ft:12-15 is equivalent to ft:12 ft:13 ft:14 ft:15
The syntax for a list of integers is a comma separated list.
example:
ft:12,15,18,20 is equivalent to ft:12 ft:15 ft:18 ft:20
There are several shortcuts the user may employ to more easily match templates with incoming file names. The most powerful is wildcarding.
* matches 0 or more characters. ? matches one character. examples: the following templates all match matrix.lu *.lu matrix.* matrix.?? ?atrix.*
If there is no directory structure in the template, the matching will check only the leafname of the file being opened.
example:
template *.lu will match /tmp/jlb/statics/matrix.lu
The template may also indicate that more than one condition be met for a template match. The syntax for multiple conditions is an & separated list of individual templates. There may be no spaces in the multiconditional template, as that would indicate a new template.
example:
gt:DAU&matrix.lu
would match only a Fortran direct access unformatted open of matrix.lu.
This logic is necessary when attempting to match an open from Standard C I/O, gt:STD. Since trapping a Standard C I/O open may violate some standards the eag_ffio library forces the user to indicate that he knows that he is trapping these types of I/O. The user indicates this by specifying the group type in the template string, in addition to any other template name or file type matching logic desired.
example:
gt:STD&matrix.lu
would match an open of file matrix.lu from the Standard C I/O package.
The user may change the hardcoded defaults of a given layer by specifying it in FF_IO_DEFAULTS. This is provided as a convienence for users. It allows the FF_IO_OPTS to become shorter and more readable.
example:
" eie.sds.diag:184:100 , event.summary.trace , set.cbits=0xff.cblks=184 "
Using the above string would change the defaults for eie, event, and set to use the indicated options and numerics rather than the hardcoded defaults. An ffopen of fort.20, with FF_IO_OPTS set as in the general case above, would invoke the event layer with summary and trace as options ( rather than summary and notrace as specifed in the hardcoded defaults), and an eie layer with an sds resident cache of 100 pages each 184 blocks in size. All other options not referenced in FF_IO_DEFAULTS retain their hardcoded values.
Ffio layers are controlled by options and numeric values specified in the layer specification.
eie.mem
requests that the eie layer use a memory resident cache (rather than the mutually exclusive option of using an sds resident cache).
example:
eie.diag:92:0x20
The first numeric to the eie layer is 92(decimal), the page size in blocks.
The second numeric to the eie layer is 32(decimal), the number of pages .
example:
eie.diag.page_size=92.num_page=20
The numeric
with the keyword page_size is set to 92.
The
numeric with the keyword num_page is set to 20.
example:
eie.page_size=200.page_size=200.page_size=100
will result in a page_size of 100
Basic math can be performed on numeric values. The numeric string may include +-*.
examples:
All of the following result in a page_size of 200 blocks.
eie.page_size=200
eie.page_size=100+75+25
eie.page_size=100*2
The current value of the numeric may be modified using the C programming syntax "+=", "&=", and "|=".
example: eie.page_size=200.page_size+=25
will result in a page_size of 225 blocks.
This feature is most usefull when oring bits into the open(2) oflag via:
set.oflags_set=0o600.oflags_set|=0o020
will result in an oflags_set value of 0o620.
In the following layer descriptions the numerics are given in the dependent order and their associated keyword also is given.
For user convenience, some layers have predefined aliases which provide a shortcut for setting options and numeric values. An alias may contain any valid option, numeric keyword, numeric value, or another valid alias.
example: The eie layer has the following predefined alias: .big_sds = .ssd:184:200:6:1 specifying eie.big_sds would be equivalent to specifying eie.ssd:184:200:6:1
EIE final stats for file /usr/tmp/bauerj/sym2/SCR300 Used shared eie cache 8 29 sds cache pages of 276 blocks (69 sectors) maximum read ahead (pages) : 20 advance reads used/started : 1619/ 1906 84.94% read hits/total : 144324/ 144402 99.95% write hits/total : 8782/ 8807 99.72% Data transferred ( bytes ) program --> eie --> syscall 144293888 125157376 2365882368 2096037888 program <-- eie <-- syscall
| Alias Name | Alias Value |
|---|---|
| .summary | .diag |
| .ssd | .sds |
| .mr | .mem |
| .big_mem | .mem:184:40:4:1 |
| .big_sds | .ssd:184:200:6:1 |
The eie layer has a built-in write error recovery mechanism. If the eie layer detects that the write of a cache page failed due to any of the following errors, it will attempt to recover from the write error:
errno | Description |
|---|---|
| ENOSPC | No space left on device |
| EQURSR | User file/inode quota limit reached |
| EQGRP | Group file/inode quota limit reached |
| EQACT | Account file/inode quota limit reached |
| EDISKLIM | Disk limit exceeded |
It will first check if the environment variable FF_IO_RECOVER_CMD is set. If set, FF_IO_RECOVER_CMD is interpreted as a string that will be passed to the system (2) kernel call. Before the system (2) call is executed, the environment variable FF_IO_RECOVER_MSG will be set by the eie cache to indicate which file encountered the write error and what the error was. This provides the $FF_IO_RECOVER_CMD some information about the failure. The program with the write error will be suspended until the system(2) command completes. Upon completion of the system (2) call, the eie cache will attempt to reissue the cache page write.
A simple example of using FF_IO_RECOVER_CMD follows:
setenv FF_IO_RECOVER_CMD $HOME/bin/ffio_recover.csh
#!/bin/csh $HOME/bin/ffio_recover.csh # echo $$ $FF_IO_RECOVER_CMD | mail $USER sleep 3600 #end of $HOME/bin/ffio_recover.csh
If a write error is detected the user will receive a mail message similar to the following:
96301 program =nastbio : file =SCR300 :User file/inode quota limit reachedThe first integer is the pid of the ffio_recover.csh. The program encountering the write error is indicated, followed by the file and error message. This should be enough information for the user to log into the system and take the necessary action to allow the reissue of the write to succeed. When the user wants the program to resume and reissue the write, he must issue kill -9 96301 to kill the ffio_recover.csh which is sleeping for 3600 seconds. Once ffio_recover.csh is killed, the original program will resume execution and hopefully the write will now succeed.
Fortran sequential unformatted I/O uses the cos layer to provide record and block control. If such a file uses the following layer specification string from FF_IO_OPTS:
(set.append | sds | syscall )the layers that will be opened are:
cos, sds, syscalland the resulting file will have record or block control words.
If the layer specification string is:
( set.override | sds | syscall )the layers that will be opened are:
sds, syscalland the resulting file will not have record or block control words.
The bits that are set in this numeric indicate which layers are to be skipped in the ffopen processing. The numeric is interpreted as a bit mask, bit 0 representing layer 0, bit 1 representing layer 1, etc.
It is sometimes benefical to request that certain layers not be invoked in certain instances.
In UNICOS 70 and later, Fortran direct access unformatted I/O uses a cache layer by default. If the user intends to have the eie cache invoked for such a file it would most likely be of no use to have the default cache layer invoked. The UNICOS library cache can be skipped by specifying:
set.nocache
| Alias Name | Alias Value |
|---|---|
| .raw | .oflags_set|=o_raw |
| .noraw | .oflags_clear|=o_raw |
| .ldraw | .oflags_set|=(o_ldraw|o_raw) |
| .noldraw | .oflags_clear|=o_ldraw |
| .stripe | .oflags_set|=o_place |
| o_raw | 0x40 |
| o_ldraw | 0x80000 |
| o_place | 0x2000 |
| nocachea | .skip|=cachea |
| nocache | .skip|=cache |
| nowa | .skip|=wa |
| noevent | .skip|=event |
| nosds | .skip|=sds |
| cachea | 0x400000 |
| cache | 0x80000 |
| wa | 0x200000000 |
| event | 0x80000000 |
| sds | 0x800 |
| Alias Name | Alias Value |
|---|---|
| .log | .trace |
| .nolog | .notrace |
The following C program generates a file by writing 1000 sequential records of 16384 bytes and then reads the file forward and backward.
#include <fcntl.h>
#include <unistd.h>
main()
{
int fd, i ;
char *data ;
int nrec, bytes_per_record ;
bytes_per_record = 16384 ;
nrec = 1000 ;
fd = ffopen("test.dat", O_RDWR|O_CREAT|O_TRUNC, 0600 ) ;
data = (char *)malloc( bytes_per_record ) ;
for(i=0;i<nrec;i++) ffwrite(fd, data, bytes_per_record);
ffseek( fd, 0, SEEK_SET ) ;
for(i=0;i<nrec;i++) ffread(fd, data, bytes_per_record );
for(i=0;i<nrec;i++){
ffseek( fd, (nrec-1-i)*bytes_per_record, SEEK_SET );
ffread(fd, data, bytes_per_record ) ;
}
ffclose( fd ) ;
}
The following lines generate the executable.
cc -c example.c segldr -o example example.o libeag_ffio.a \ -D"hardref=_evt_ffvect;hardref=_eie_ffvect" cc -o exammple example.c -lffio -leag_ffio
The following script will run the program with only the system layer
#!/bin/csh setenv FF_IO_LOGFILE example.log setenv FF_IO_OPEN_DIAGS true setenv FF_IO_OPTS " *.dat ( syscall ) " # ja ./example ja -cls > example.ja
Two files of interest, example.log and example.ja, are generated by the script.
% cat example.log *********************************************************************** FF_IO version 7.1BF1 Apr 6 1994 C90 UNICOS 80 program=./example Thu Apr 7 11:24:25 1994 FF_IO_DEFAULTS =(NULL) FF_IO_OPTS =*.dat ( syscall ) FF_IO_LOGFILE =example.log FF_IO_OPEN_DIAGS =true FF_IO_TRACE_FILE =(NULL) FF_IO_RECOVER_CMD=(NULL) ffopen(/tmp/jtmp.000186a/.assign) ft:0 gt:246=KER /tmp/jtmp.000186a/.assign : will not use ffio ffopen(file.dat) ft:0 gt:0 opening layer syscall file.dat : using layers : syscall
The diagnostics in example.log indicate that the ffopen of the first file opened (temporary file for the assign command) will not use ffio. The ffopen of file test.dat matched the template *.dat, resulting in the invocation of the system layer for the file.
% cat example.ja Operating System : sn4025 hot 8.1.0bw d81.21 CRAY C90 Report Starts : 04/07/94 11:24:24 Report Ends : 04/07/94 11:24:42 Elapsed Time : 18 Seconds User CPU Time : 0.0262 Seconds System CPU Time : 0.8264 Seconds I/O Wait Time (Locked) : 17.6853 Seconds I/O Wait Time (Unlocked) : 0.0694 Seconds Data Transferred : 5.8594 MWords Maximum memory used : 0.1953 MWords Logical I/O Requests : 3003 Physical I/O Requests : 3013
The statistics in the ja output indicate the 3003 logical I/O requests transferred 5.8594 Mwords of data with 17.6853 seconds of locked I/O wait time.
User striping allows a user to request that a file be allocated across multiple partitions of a file system. Typically, a file system is made up of many partitions with each partition residing on an independant disk and I/O channel. This will allow multiple requests to a file to occur simulataneously, increasing the effective transfer rate to the file. The first step with user striping is to determine the file system configuration. This is achieved with the df command, for which a sample output follows.
hot% df -p /usr/tmp
/usr/tmp (/dev/dsk/usr_tmp ): 2891665 sectors 0 trks 2993492 I-nodes
total: 2992300 sectors (0 trks) 2993600 I-nodes
Big file threshold: 32768 bytes
Big file allocation minimum: 24 blocks
Allocation Strategy: round robin files
round robin all user data
Primary partitions allocation unit: 16K byte blocks
part start total free (%) frags (%) device
---- -------- -------- ----------------- ---------------- --------
0 0 478768 471128 ( 98.4%) 1 ( 0.000%) utmp1230
1 478768 478768 470800 ( 98.3%) 1 ( 0.000%) utmp2030
2 957536 478768 470800 ( 98.3%) 1 ( 0.000%) utmp2130
3 1436304 478768 470800 ( 98.3%) 1 ( 0.000%) utmp2230
4 1915072 478768 396944 ( 82.9%) 1 ( 0.000%) utmp2330
5 2393840 478768 466620 ( 97.5%) 1 ( 0.000%) utmp3030
6 2872608 478768 456348 ( 95.3%) 1 ( 0.000%) utmp3330
7 3351376 478768 456440 ( 95.3%) 1 ( 0.000%) utmp2031
8 3830144 478768 456440 ( 95.3%) 1 ( 0.000%) utmp2131
9 4308912 478768 456532 ( 95.4%) 1 ( 0.000%) utmp2231
10 4787680 478768 456532 ( 95.4%) 1 ( 0.000%) utmp2331
11 5266448 478768 456532 ( 95.4%) 1 ( 0.000%) utmp3031
12 5745216 478768 456532 ( 95.4%) 1 ( 0.000%) utmp3331
13 6223984 478768 456532 ( 95.4%) 1 ( 0.000%) utmp2032
14 6702752 478768 456532 ( 95.4%) 1 ( 0.000%) utmp2132
15 7181520 478768 470984 ( 98.4%) 1 ( 0.000%) utmp2232
16 7660288 478768 471168 ( 98.4%) 1 ( 0.000%) utmp2332
17 8139056 478768 471168 ( 98.4%) 1 ( 0.000%) utmp3032
18 8617824 478768 471124 ( 98.4%) 1 ( 0.000%) utmp3332
19 9096592 478768 470936 ( 98.4%) 1 ( 0.000%) utmp2033
20 9575360 478768 471132 ( 98.4%) 1 ( 0.000%) utmp2133
21 10054128 478768 471124 ( 98.4%) 1 ( 0.000%) utmp2233
22 10532896 478768 471168 ( 98.4%) 1 ( 0.000%) utmp2333
23 11011664 478768 471172 ( 98.4%) 1 ( 0.000%) utmp3033
24 11490432 478768 471172 ( 98.4%) 1 ( 0.000%) utmp3343
From the df output it is observed that the file system /usr/tmp has 25 partitions (0-24), each residing on independent disks/channels (the device for each partition is a unique). When selecting a cblks for user striping, a good first guess is to use a multiple of the big file allocation minimum size from the df output. cblks is the number of blocks that the kernel will allocate to a given partition before rotating to the next partition specified in cbits. The selection of cbits can be a bit complicated. Typically, one should avoid the use of partition 0 since it is heavily used by the kernel to store inode information for the file system. If a particular partition has little free space, it too should be avoided. Once the user has determined the partitions to be used, the cbits value needs to be calculated. The cbits value is a mask indicating which partitions of a file system are to be used for user striping the file. The rightmost bit of the cbits word represents partition 0, the second bit from the right, partition 1, etc. An example showing the computation of cbits for the selection of partitions 1,2,4,6,7, and 8 follows.
set.cbits=0x1d6.cblks=92
0x 1 d 6 <-- hexidecimal value of 4 bit quantites
0001 1101 0110
|||| |||| ||||_ bit for partition 0
|||| |||| |||__ bit for partition 1
|||| |||| ||___ bit for partition 2
|||| |||| |____ bit for partition 3
|||| ||||
|||| ||||_______ bit for partition 4
|||| |||________ bit for partition 5
|||| ||_________ bit for partition 6
|||| |__________ bit for partition 7
||||
||||____________ bit for partition 8
|||_____________ bit for partition 9
||______________ bit for partition 10
|_______________ bit for partition 11
Like any ffio numeric value, the user also could specify cbits in base 8 (octal), base 10 (decimal), or base 16 (hexidecimal).
For the above example, the following are all equvialent since they all represent the same integer value and bit pattern.
set.cbits=0x1d6
set.cbits=0o726
set.cbits=470
It may be more convenient to use an octal representation of cbits since the fck command returns the cbits value in octal format.
We can run the program in example 1 with user striping, using a cbits value of 0xfffe, which will request that the file be allocated across partions 1 through 15, and a cblks value of 92. The following script will run the program with user striping and the syscall layer:
#!/bin/csh
setenv FF_IO_LOGFILE example.log
setenv FF_IO_OPEN_DIAGS true
setenv FF_IO_OPTS " *.dat ( set.cblks=92.cbits=0xfffe | syscall ) "
#
ja
./example
ja -cls > example.ja
Using the fck command we can verify that the file was indeed user striped. The fck output indicates that cblks is 92 and cbits is 0177776(which is equivalent to 0xfffe). It also can be observed that the file striped as predicted, starting with 92 blocks on slice (partition) 1, then 92 blocks on slice 2, etc.
hot% fck -ilbp file.dat
File: file.dat Inode: 33 size: 16384000
dev: 34/50 rdev: 0/0 links: 1
blocks: 4052 cblks: 92 cbits: 0177776
mode: 100600 perm: 600 type: regular
UID: 210 GID: 0
acid: 210 gen: 387311198
inode changed: Thu Apr 21 10:06:50 1994
last modified: Thu Apr 21 10:06:50 1994
last accessed: Thu Apr 21 10:06:50 1994
Item Start blk Count Total Slc Log. Dev Phy. Dev Iopth Unit Cyl Trk Sectors
----- --------- ----- ------- --- -------- -------- ----- ---- ----- --- -------
data ********* 92 92 1 usr_tmp utmp2030 2030 0 41 0 14-22
utmp2030 2030 0 41 1 0-13
data ********* 92 184 2 usr_tmp utmp2130 2130 0 41 0 13-22
utmp2130 2130 0 41 1 0-12
data ********* 92 276 3 usr_tmp utmp2230 2230 0 41 0 14-22
utmp2230 2230 0 41 1 0-13
data ********* 92 368 4 usr_tmp utmp2330 2330 0 442 1 8-22
utmp2330 2330 0 443 0 0-7
data ********* 92 460 5 usr_tmp utmp3030 3030 0 64 0 1-22
utmp3030 3030 0 64 1 0
data ********* 92 552 6 usr_tmp utmp3330 3330 0 119 1 16-22
utmp3330 3330 0 120 0 0-15
data ********* 92 644 7 usr_tmp utmp2032 2032 0 119 1 16-22
utmp2032 2032 0 120 0 0-15
addr ********* 1 645 0 usr_tmp utmp1234 1234 0 41 0 21
data ********* 92 737 8 usr_tmp utmp2132 2132 0 119 1 16-22
utmp2132 2132 0 120 0 0-15
data ********* 92 829 1 usr_tmp utmp2030 2030 0 41 1 14-22
utmp2030 2030 0 42 0 0-13
data ********* 92 921 2 usr_tmp utmp2130 2130 0 41 1 13-22
utmp2130 2130 0 42 0 0-12
data ********* 92 1013 3 usr_tmp utmp2230 2230 0 41 1 14-22
utmp2230 2230 0 42 0 0-13
data ********* 92 1105 4 usr_tmp utmp2330 2330 0 443 0 8-22
utmp2330 2330 0 443 1 0-7
data ********* 92 1197 5 usr_tmp utmp3030 3030 0 64 1 1-22
utmp3030 3030 0 65 0 0
data ********* 92 1289 6 usr_tmp utmp3330 3330 0 120 0 16-22
utmp3330 3330 0 120 1 0-15
data ********* 92 1381 7 usr_tmp utmp2032 2032 0 120 0 16-22
utmp2032 2032 0 120 1 0-15
data ********* 92 1473 8 usr_tmp utmp2132 2132 0 120 0 16-22
utmp2132 2132 0 120 1 0-15
data ********* 92 1565 9 usr_tmp utmp2232 2232 0 119 1 16-22
utmp2232 2232 0 120 0 0-15
data ********* 92 1657 10 usr_tmp utmp2332 2332 0 119 1 16-22
The ja output indicates that the I/O performance is very similar to the example run without user striping. This results from the lack of any asynchronous I/O requests from the program to the kernel. For this example it makes no difference if the synchronous requests are satisfied by 1 channel or 15 channels. Since each request is snychronous there is no possibility for multiple requests to be satisfied concurrently which would take advantage of the multiple channels.
Operating System : sn4025 hot 8.1.0cd u81.2 CRAY C90 Report Starts : 04/21/94 10:06:32 Report Ends : 04/21/94 10:06:49 Elapsed Time : 17 Seconds User CPU Time : 0.0277 Seconds System CPU Time : 0.7761 Seconds I/O Wait Time (Locked) : 17.0975 Seconds I/O Wait Time (Unlocked) : 0.1331 Seconds Data Transferred : 5.8594 MWords Maximum memory used : 0.1563 MWords Logical I/O Requests : 3003 Physical I/O Requests : 3056
We now can run the program with FF_IO_OPTS set as follows:
setenv FF_IO_OPTS "*.dat ( eie.mem.diag:92:20:5:0 | set:92:0xfffe )"
Again we have the two files of interest, example.log and example.ja.
%cat example.log
***********************************************************************
FF_IO version 7.1BF1 Apr 6 1994 C90 UNICOS 80
program=./example
Thu Apr 7 11:24:43 1994
FF_IO_DEFAULTS = (NULL)
FF_IO_OPTS = *.dat ( eie.mem.diag:92:20:5:0 | set:92:0xfffe )
FF_IO_LOGFILE =example.log
FF_IO_OPEN_DIAGS =true
FF_IO_TRACE_FILE =(NULL)
FF_IO_RECOVER_CMD=(NULL)
ffopen(/tmp/jtmp.000186a/.assign) ft:0 gt:246=KER
checking templates :*.dat
!= *.dat
/tmp/jtmp.000186a/.assign : will not use ffio
ffopen(file.dat) ft:0 gt:0
checking templates :*.dat
== *.dat
requested layers :set|eie
opening layer
eie.mem.diag.save.nobpons.wb.rls.listio.bytes:92:20:5:0::
opening layer system
file.dat : using layers : eie syscall
eie_close EIE final stats for file file.dat
eie_close Used private cache
eie_close 20 mem pages of 92 blocks (23 sectors), max_lead = 5 pages
eie_close advance reads used/started : 108/ 113 95.58%
eie_close read hits/total : 1998/ 2000 99.90%
eie_close write hits/total : 998/ 1000 99.80%
eie_close Data transferred ( bytes ) program --> eie --> syscall
eie_close 16384000 16384000
eie_close 32768000 25427968
eie_close program <-- eie <-- syscall
Again, the file test.dat matched the template *.dat and the layers eie and set were opened for the file. When the file was closed, the diag option to eie generated the above output, informing the user of the activity of the cache. The advance reads used/started line reflects the percentage of correct anticipation by the cache of pages that it prereads. The read and write hits indicate that the percentage of incoming requests that were satisfied by the cache without having the cache request any data from it's child layer. The "Data transferred" lines indicate the amount of bytes that was read and written by the cache's parent, and the amount of read/written by the cache from it's child.
%cat example.ja Operating System : sn4025 hot 8.1.0bw d81.21 CRAY C90 Report Starts : 04/07/94 11:24:43 Report Ends : 04/07/94 11:24:45 Elapsed Time : 2 Seconds User CPU Time : 0.0621 Seconds System CPU Time : 0.0715 Seconds I/O Wait Time (Locked) : 2.4317 Seconds I/O Wait Time (Unlocked) : 0.1833 Seconds Data Transferred : 4.9854 MWords Maximum memory used : 1.0859 MWords Logical I/O Requests : 126 Physical I/O Requests : 125
The ja statistics now report that 126 logical I/O requests transferred 4.9854 Mwords of data with 2.4317 seconds of Locked I/O wait time. The differences in the ja statistics reflect the effects of the eie cache. There are only 126 logical I/O requests, versus the original 3003, since the cache is buffering the programUs 16384 byte requests into cache page equests that are 92 blocks (376832 bytes) in size. The Data Transferred is less since data was used multiple times out of the cache, meaning the program did not have to go to the kernel for the data that was reused. The I/O wait time was much less since the cache prefetched the bulk of the data with asyncrounous requests, which were completed by the time the program actually needed the data.
There are several numbers that can be examined for a sanity check. In the example run without the eie cache, ja reported 5.8594 Mwords of Data Transferred. Since the program issued requests to the eie cache, rather than straight to the kernel, eie should report the same amount of data being requested by the program. Adding the 16384000 bytes written and 32768000 bytes read by the program (as reported by the eie cache) totals 49152000 bytes = 46875Mbytes = 5.8594 Mwords, producing a number similar to the non-cached Data Transferred. It also should be noted that the "Maximum memory used" increased from 0.1952 Mw in the non-eie example to 1.0859 Mw in the eie example. The increase was caused by the additional memory used by the eie cache pages (92*512*20/1048576).The fck output for this example is identical to that of example 2, since the user striping parameters are identical (cbits=0xfffe and cblks=92). Unlike example 2, user striping does provide a benifit in this example since the eie cache is issuing asynchronous requests to the kernel to preread data into the cache. Many of these cache page prereads may be satisfied in parallel since the file is laid out on the file system with consecutive pages residing on independent channels. Note that the cache page size is the same as the striping factor, cblks, which places one full cache page per channel before cycling back to the first partition. This means each logical I/O request results in one physical I/O request, which is reflected in the ja output. Alternate stratagies may be implemented which stipes each cache page over multiple channels (using a cache page size that is a multiple of the stripe factor). This will result in each logical I/O request to read a cache page, requiring multiple physical I/O requests.
This example is identical to example 3 with the exception that the event layer will be used both before and after the eie cache to monitor the I/O events coming into the cache from the program and the I/O events issued to the syscall layer by the eie cache.
The following script is used to run the program:
#!/bin/csh
setenv FF_IO_LOGFILE example.log
setenv FF_IO_OPEN_DIAGS true
setenv FF_IO_DEFAULTS " event.summary ,eie.mem.diag:92:20:5:0 ,set:92:0xfffe "
setenv FF_IO_OPTS "*.dat ( set | event | eie | event | syscall )"
#
ja
./example
ja -cls > example.ja
The ja output for this example is nearly identical to example 3 since the only the event layer was added which uses very little cpu time and issues only a few logical I/O requests.
The event layer before the eie cache will output a summary similar to the following.
evt_close(file.dat) program<-->eie ( 49152000 bytes)/( 2.47 s)=19899595.87 bytes/s
open flags=0x0000400000002342=RAW+RDWR+CREAT+TRUNC+PLACE
sector size =4096(bytes)
cblks =92 cbits =0x000000000000fffe
current file size =16384000 bytes high water file size =16384000 bytes
function times ill wait bytes bytes min max avg all
called formed time requested delivered request request request hidden
open 1 0.05
seek 1001
write 1000 0 1.45 16384000 16384000 16384 16384 16384
read 2000 0 0.97 32768000 32768000 16384 16384 16384
close 1 0.00
extends 1000
The first line of the event layer output reports the file being closed and its layer position. The string "program<-->eie" indicates that this event summary is for the event layer between the program and the eie layer. This event layer reports statistics that would be expected from the program being run. There were 1000 writes of 16384 bytes each, 2000 reads of 16384 bytes each, and 1001 seeks. The event layer below the cache reports the I/O requests that the eie cache is making to the kernel via the syscall layer. It is observed from the event layer output that the eie cache is using listio requests rather that reads and writes. The 1000 program writes have been reduced to 44 asyncrounous writes to the kernel, 43 of which were completed by the time the asynchronous write was recalled. This is the write behind feature of the eie cache. There were 2 synchronous reads of cache pages issued by the cache before the sequential access was detected and 66 asynchronous reads then followed. 30 of the asynchronous reads were completed by the kernel before the data was actually needed by the eie cache to satisfy a program request for data. This is the benefit of the read ahead feature of the cache. The recall wait time under the fcntl heading is the amount of time spent waiting for the kernel to complete the asynchronous reads and writes.
evt_close(file.dat ) eie <-->syscall ( 41811968 bytes)/( 2.42 s)=17277672.64 bytes/s
open flags=0x0000400000002342=RAW+RDWR+CREAT+TRUNC+PLACE
sector size =4096(bytes)
cblks =92 cbits =0x000000000000fffe
current file size =16384000 bytes high water file size =16384000 bytes
function times ill wait bytes bytes min max avg all
called formed time requested delivered request request request hidden
open 1 0.05
listio 112 1.58
seek 27
writea 44 0 16384000 16384000 180224 376832 372363 43
read 2 0 753664 753664 376832 376832 376832
reada 66 0 24674304 24674304 180224 376832 373853 38
fcntl
recall 110 0.79
other 3 0.00
flush 1 0.00
close 1 0.00
extends 44