EAG_FFIO Programming and Usage Guide

Contents

  1. Introduction
  2. Terminology
  3. Understanding Your Data's Trip to/from Disk
  4. I/O Performance Measurement (UNICOS)
  5. Access to I/O at the Programming Level
  6. Loading the eag_ffio Library into a Program
  7. FF_IO Environment Variables
  8. Layer Descriptions
  9. Examples


Introduction

The Engineering Applications Group, EAG, at Cray Research has worked for years helping vendors optimize applications on Cray Research supercomputers. Much of this work concentrated on vectorizing and parallelizing the codes to exploit the ever increasing hardware speed. However, little work was done to ensure that the I/O performance of these codes on Cray Research systems was optimal. With the improvements in performance due to code optimization, the I/O requirements of large-scale applications have become even more obvious. For many large runs in structural analysis applications, the I/O wait time would often exceed the CPU time. This overhead increases the time to solution and also makes gains through parallel processing less effective in reducing time to solution.

In early 1992 a project was started within EAG to allow applications to take full advantage of the high performance I/O features of Cray Research systems. EAG has been developing an I/O library, libeag_ffio.a, which will allow individual applications to:

  1. provide their own caching of I/O data.
  2. measure the I/O performance on a per file basis.
  3. track I/O activity on an event basis.
  4. utilize user striping of files. (UNICOS systems only)

This documentation is intended to aid users of eag_ffio in the understanding of the I/O in their programs and how to include and utilize eag_ffio in the optimization of the I/O performance.

Although eag_ffio was originally written for Cray Research Systems running the UNICOS operating system it has been ported to IRIX with much of the functionality intact.


Terminology

An understanding of the following definitions is essential when attempting to optimize the I/O of any program.

Kernel
The IRIX or UNICOS operating system.
Logical I/O request
One of the following kernel calls issued by the user program requesting the kernel to move data to/from user space.
Physical I/O request
The kernel's request to underlying device drivers to satisfy a logical I/O request.
Block
4096 bytes
Sector
The minimum request size of a physical device.
Historically the minimum request for Cray systems was one block, and the terms block and sector were used interchangeably. However, some newer disk drives now use sector sizes of 4 or more blocks.
Well-formed I/O request
A logical I/O request that begins and ends on sector boundaries.
Buffered I/O
The kernel will copy the user's data to be read/written into a buffer in kernel memory. After the copy from user space to the kernel buffer is complete, the user's process is eligible to be rolled out of memory ( not locked in memory ), resulting in Unlocked I/O wait time. The transfer to disk will then be from the kernel's buffer, rather than the user's.
Unbuffered I/O
RAW I/O (UNICOS)
DIRECT I/O (IRIX)
If certain conditions are met, the data being read/written from user space will not be buffered by the kernel. The data will be transferred directly from the user space to the underlying disk driver. On UNICOS systems this means that the user's program can not be rolled out of memory until the transfer is complete, (the user's program is locked in memory) resulting in Locked I/O wait time. The conditions that must be met for raw I/O to occur are:
  1. the logical request begins and ends on sector boundaries (well-formed)
  2. the file was opened with the O_RAW bit set( see open(2))
or
  1. the logical request begins and ends on sector boundaries (well-formed)
  2. the file resides on a file system that is ldcached.
Current position
The byte location of a given file where the next I/O operation will start.
Dirty cache page
A cache page that contains data written by the program, but has not yet been written to disk.
LDCACHE (UNICOS only)
Logical Device Cache: UNICOS concept of using faster memory devices like main memory or SSD to cache logical device I/O transfers.

Understanding Your Data's Trip to/from Disk

The following program will write four bytes from memory to a file.


/*  1 */    #include <fcntl.h>
/*  2 */    #include <unistd.h>
/*  3 */    main()
/*  4 */    {
/*  5 */       int fd, ret ;
/*  6 */       char *data_ptr = "abcd" ;
/*  7 */       fd  = open("file.dat", O_RDWR|O_CREAT , 0640 ) ;
/*  8 */       ret = lseek( fd , 3001 , SEEK_SET ) ;
/*  9 */       ret = write( fd , data_ptr , 4) ;
/* 10 */       close(fd);
/* 11 */    }


    line 7 : 	The open kernel call returns a file descriptor, fd, which the 
		program uses to reference the file for future I/O operations.  

    line 8 : 	The lseek kernel call informs the kernel to set the current
		position of file fd to byte 3001.

    line 9 : 	The write kernel call requests that the kernel copy four bytes 
   		from the user's memory location pointed to by data_ptr, to the
  		file fd, starting at the current position.

What must the kernel do to satisfy this write request?

  1. Determine which physical sector contains bytes 3001 through 3004 of the file referenced by fd.
  2. Determine if this sector is in the kernel's buffers.
  3. If not in the kernel's buffers, the kernel must free up one of its buffers and then issue a request to the I/O subsystem to transfer a copy of the complete sector from the physical disk sector into the kernel buffer.
  4. Once a copy of the physical sector is in the kernel buffer, the kernel will do a memory copy of the 4 bytes from the user's memory pointed by to by data_ptr, to the kernel buffer's 4 bytes corresponding to file bytes 3001-3004.
  5. At some later time, the kernel will need the buffer for some other I/O activity, and will issue a request to the I/O subsystem to transfer the updated copy of the kernel buffer to the physical disk sector. Once the transfer is complete, the buffer may be used for some other I/O activity.

I/O performance measurement (UNICOS)

When examining the I/O performance of a program there are four items of interest:

  1. The number of logical requests
  2. The amount of data transferred.
  3. The amount of time spent waiting for the logical I/O requests to complete.
  4. The amount of cpu time the kernel spent processing the I/O requests, which is part of the system cpu time.

As a rule, the greater the number of logical I/O requests, the greater the system cpu time. It is generally desirable to reduce the number of logical I/O requests by increasing the size of each logical I/O request. This has a double impact. There are fewer logical I/O requests for the kernel to handle, reducing the system cpu time. The I/O wait times will be reduced since the fixed startup time of any I/O request will be spread out over the larger requests.

To understand how to tune I/O, one first must have an understanding of the I/O measurement tools that are available to gauge the performance. The two main tools are ja (job accounting) and procstat (process statistics).

ja

The first (and simplest) tool is ja. The output from ja will give the I/O wait times, amount of data transferred, and the number of logical and physical requests.

Sample output of relevant I/O fields from "ja -cl":


      
	Job Accounting - Command Report
	===============================
	
Command ...  I/O Wait I/O Wait   ...  Kwords   Log I/O  Phy I/O   
Name    ...  Sec Lck  Sec Unlck  ... Xferred   Request  Request   
======= ... ========= ========== ... ========  =======  =======   
nastbio ...    4.4296     0.5219 ... 24988.00     1213     1300   
      

Descriptions of the fields (taken from "man ja"):

I/O Wait Sec Lck
Amount of time the process waits for I/O while it is locked in memory. I/O wait time is the time a process is blocked until it is rescheduled. The process is blocked while waiting for things such as raw I/O to complete.
I/O Wait sec Unlck
Amount of time the process is blocked until it is rescheduled while it is not locked in memory. Time spent for system buffers and buffered I/O blocks are included here.
Kwords Xferred
Number of Kwords read or written by the read, write, reada, writea, and listio system calls (see read(2), write(2), reada(2), writea(2), and listio(2)).
Log I/O Request
Number of logical I/O requests performed by the process. A logical I/O request is performed each time a process calls a read, write, reada, or writea system call. When the listio system call (see listio(2)) is called, the number of logical I/O requests is equal to the number of strides multiplied by the number of requests processed.
Phy I/O Request
Number of times data was actually read/written from/to a device. Requests found in the buffer cache and requests retrieved along with another I/O request are not included in this count.

Other useful options are "-se". These options together produce extended summary report:

-e Generates an extended summary report; you must use -e with the -s option. The following are descriptions of fields produced by specifying the -e option with the -s option. These fields provide additional accumulated statistics for the reporting period. Several fields contain values only if performance accounting has been enabled; otherwise, the string NA is printed instead. System Call Time Total amount of time (in seconds) that the processes executed system calls. I/O Wait Time (Terminals) Total amount of time in seconds that the processes waited for I/O from and to terminals. This field contains a significant value only if performance accounting is enabled (see devacct(8)). Wait Time while Swapped Total amount of time (in seconds) that the processes waited while swapped out of memory. This field contains a significant value only if performance accounting is enabled (see devacct(8)). Number of Swaps Number of times the processes were swapped out of memory. Physical Blocks Moved (Bufd I/O) Number of physical blocks transferred by processes to and from block devices by using the system buffer I/O interface. This field contains a significant value only if performance accounting is enabled (see devacct(8)). Physical Blocks Moved (Raw I/O) Number of physical blocks transferred by processes to and from block devices by using the raw I/O interface. This field contains a significant value only if performance accounting is enabled (see devacct(8)).

procstat

Recommended usage:

procstat -R file_name command arg1 arg2 ... argn 

The procstat command produces information on the current activity of a specified command. The information always includes process start and exit, and by default includes memory activity and I/O usage (including SDS usage). The recommended usage of procstat is with the -R option which will produce a run time statistics data file which can be viewed with procview.

Where ja reports total I/O performed by a program, procstat will report I/O activity, on a per-file basis, between the program and the kernel. Procview will report the number of reads, writes, and bytes transferred for each file open/close pair. Additional information includes maximum file sizes, I/O transfer rates, I/O wait times, and types of I/O.


Access to I/O at the programming level

When a program is being developed, the programmer has a wide variety of ways in which to perform I/O. On Cray Research platforms, some of the more common ones are:

  1. Fortran
  2. Standard C I/O ( fopen, fread, fwrite, ... )
  3. AQIO ( AQOPEN, AQREAD, AQWRITE, ... )
  4. Flexible file calls ( ffopen, ffread, ffwrite, ... )
  5. Word addressable package ( WOPEN, GETWA, PUTWA )

When using the eag_ffio library, it is possible to trap I/O which was initiated in any of the above modes. The I/O may then pass through any of the standard UNICOS ffio layers ( sds,mr, cos, ... ) in addition to the layers available in the eag_ffio library ( eie, event, set).

Some of these modes already use ffio by default. Fortran sequential unformatted I/O uses ffio with the cos layer to provide the required cos blocking. Fortran direct access unformatted I/O uses the cachea layer to buffer the fortran records.

Fortran sequential formatted I/O uses the Standard C I/O library to buffer its I/O. Since it is possible to trap Standard C I/O, it is possible to trap Fortran sequential formatted I/O.

In the standard UNICOS libraries, AQIO uses system calls by default. The user must use an "assign -F system filename" command to force the AQIO package to use ffio for the given file, which will allow it to be trapped by eag_ffio.


Loading the eag_ffio library into a program

On UNICOS systems libeag_ffio.a is delivered as an unsupported and undocumented library with Programming Environment 3.0 and later. Include the libeag_ffio when linking

cc -o main main.c -leag_ffio

On IRIX systems libeag_ffio.so is delivered as an unsupported and undocumented library with MIPSpro 7.2.1 and later. Include the libeag_ffio when linking

cc -o main main.c -lffio -leag_ffio

To check if the link successfully included the eag_ffio library:


what a.out | grep ffopen

should produce a line similar to the following:

trap_ffopen.c 6.4BF0 Jul 31 1993 YMP UNICOS 70

FF_IO environment variables

There are 4 environment variables that affect the actions of the eag_ffio library:

Two others, FF_IO_TRACE_FILE and FF_IO_RECOVER_CMD , are discussed in subsequent sections.


Layer Descriptions

Ffio layers are controlled by options and numeric values specified in the layer specification.

Options:

Numerics:

Basic math can be performed on numeric values. The numeric string may include +-*.


        examples:
		All of the following result in a page_size of 200 blocks.

		eie.page_size=200
		eie.page_size=100+75+25
		eie.page_size=100*2


The current value of the numeric may be modified using the C programming syntax "+=", "&=", and "|=".


	example:
		eie.page_size=200.page_size+=25

will result in a page_size of 225 blocks.

This feature is most usefull when oring bits into the open(2) oflag via:


	set.oflags_set=0o600.oflags_set|=0o020

will result in an oflags_set value of 0o620.

In the following layer descriptions the numerics are given in the dependent order and their associated keyword also is given.

Aliases

For user convenience, some layers have predefined aliases which provide a shortcut for setting options and numeric values. An alias may contain any valid option, numeric keyword, numeric value, or another valid alias.

	example:
	The eie layer has the following predefined alias:
		.big_sds = .ssd:184:200:6:1
		specifying 
			eie.big_sds
		would be equivalent to specifying
			eie.ssd:184:200:6:1

eie layer

General Description

The eie ( Enhanced Intelligence Engineering ) layer performs two main functions. The first is caching of user I/O requests in either user memory or user sds space. The primary benefit of this caching is that the smaller program I/O requests will be buffered into full cache page requests to the kernel. This can dramatically cut down on the number of logical I/O requests, reducing both system cpu time and I/O wait time. The second function of the eie layer is to detect sequential access of the file by the program. Once this sequential access is detected, the cache will asnychronously preload cache pages with file data in the direction of the detected access.

Numerics

  1. page_size
    The number of blocks per cache page. If the requested page_size is not a multiple of the sector size of the underlying file system, page_size is automatically rounded up to the next multiple of the sector size. If the cache is a shared cache, the page_size is rounded up to a multiple of 4 blocks.
  2. num_page
    The number of pages in the cache. If the eag_ffio library version is 6.9BF0 or greater and num_page is entered as a negative value, num_page is interpreted as the total size of the cache and the number of pages is calculated by dividing the absolute value of num_page by page_size.
  3. max_lead
    The maximum number of pages the cache is to asyncronoulsy read ahead of the program requests once sequential access has been detected.
  4. share
    Indicates a private or shared cache. A private cache, share=0, is used only by the file requesting the cache. A shared cache allows up to 255 files to share the cache pages between the files. The first file to request the cache defines the page_size, num_page, and cache residency (mem or sds). All other files using the cache use it as defined by the first file. By default, if the file usage count drops to zero the memory or sds space allocated to the cache is returned to the heap. If the shared cache is opened again, the space must be reallocated from the heap. This may cause heap fragmentation. To prevent this, use the norls option, informing the shared cache not to release it's space even if the file usage count goes to zero.
  5. stride
    By default, the eie cache attempts to detect sequential access of cache pages with a stride of 1 page. The user may select an alternate stride to be detected via the stride numeric.
  6. alloc
    The alloc numeric represents the number of pages the cache should request the kernel to allocate to the file each time the cache is to write a cache page that would extend the file beyond it's current allocation. Performing large allocations at once can be very beneficial on file systems with slow transfer rates. By default, the cache does not request any allocations for the file, but relies on the default action of the kernel/file system to provide the allocations.

Options

mem, sds
The cache can reside in either user memory (mem), the default, or user secondary data segment (sds).
nodiag, diag
The cache is to report cache usage statistics, diag, or not to to report statistics, nodiag. The report will have the following format.

	EIE final stats for file /usr/tmp/bauerj/sym2/SCR300
 	Used shared eie cache 8
	29 sds cache pages of      276 blocks (69 sectors)
	maximum read ahead (pages) :       20
	advance reads used/started :     1619/    1906   84.94%
	read  hits/total           :   144324/  144402   99.95%
	write hits/total           :     8782/    8807   99.72%
		Data transferred   ( bytes )
  		program --> eie  --> syscall
 	  		   144293888 125157376
			   2365882368 2096037888
		program <-- eie  <-- syscall

bytes, mbytes, gbytes, words, mwords, gwords, blocks
selects which units are to be used for reporting data transferred values to $FF_IO_LOGFILE. For units other than bytes, the quantities are always rounded up to the next multiple of the units.
wb, nowb, hldwb
The eie cache has logic to asynchronously write out dirty cache pages. This logic is on by default. It may be completely disabled with nowb. The write behind logic may be delayed until a cache runs out of unused pages by specifying hldwb (hold write-behind).
save, scr
The save option indicates that the file is to be valid upon closing. The scr option indicates that the file need not be valid upon closing, allowing the cache to skip some flushing of data.
rls, norls
These options apply only to a shared cache. The default option, rls, has the cache release the memory or sds space that the cache pages reside in when file count (file count is the number of files currently opened and using the shared cache) goes to zero. The norls option has the cache remain allocated even when the file count goes to zero. The norls option is usefull in preventing heap fragmentation if a shared cache is opened and closed multiple times.
nobpons,bpons ( bypass on no space)
If the eie cache is unable to allocate sufficient space for the cache pages, the bpons option will allow the open to continue without the cache. By default, nobpons, the open will fail.

Aliases

Alias NameAlias Value
.summary.diag
.ssd.sds
.mr.mem
.big_mem.mem:184:40:4:1
.big_sds.ssd:184:200:6:1

FF_IO_RECOVER_CMD

The eie layer has a built-in write error recovery mechanism. If the eie layer detects that the write of a cache page failed due to any of the following errors, it will attempt to recover from the write error:

errnoDescription
ENOSPCNo space left on device
EQURSRUser file/inode quota limit reached
EQGRPGroup file/inode quota limit reached
EQACTAccount file/inode quota limit reached
EDISKLIMDisk limit exceeded

It will first check if the environment variable FF_IO_RECOVER_CMD is set. If set, FF_IO_RECOVER_CMD is interpreted as a string that will be passed to the system (2) kernel call. Before the system (2) call is executed, the environment variable FF_IO_RECOVER_MSG will be set by the eie cache to indicate which file encountered the write error and what the error was. This provides the $FF_IO_RECOVER_CMD some information about the failure. The program with the write error will be suspended until the system(2) command completes. Upon completion of the system (2) call, the eie cache will attempt to reissue the cache page write.

A simple example of using FF_IO_RECOVER_CMD follows:

setenv FF_IO_RECOVER_CMD $HOME/bin/ffio_recover.csh

#!/bin/csh  $HOME/bin/ffio_recover.csh
#
echo $$ $FF_IO_RECOVER_CMD | mail $USER
sleep 3600
#end of $HOME/bin/ffio_recover.csh

If a write error is detected the user will receive a mail message similar to the following:

96301 program =nastbio : file =SCR300 :User file/inode quota limit reached

The first integer is the pid of the ffio_recover.csh. The program encountering the write error is indicated, followed by the file and error message. This should be enough information for the user to log into the system and take the necessary action to allow the reissue of the write to succeed. When the user wants the program to resume and reissue the write, he must issue kill -9 96301 to kill the ffio_recover.csh which is sleeping for 3600 seconds. Once ffio_recover.csh is killed, the original program will resume execution and hopefully the write will now succeed.


set layer

General Description

The set layer is a catch-all pseudo-layer that provides many functions during the open processing of a file. There is no actual set layer. The most significant function of the set layer involves the user striping of a file. This is accomplished by setting the cbits and cblks values for the resulting open(2) system call. Other functions include setting or clearing bits in the oflag word for the open(2) system call, and disabling the invocation of selected layers.

Options

append, override
The layers in the layer specifications string are to be either appended to, or completely override, any incoming layers.

example:

Fortran sequential unformatted I/O uses the cos layer to provide record and block control. If such a file uses the following layer specification string from FF_IO_OPTS:

(set.append | sds | syscall )

the layers that will be opened are:

cos, sds, syscall

and the resulting file will have record or block control words.

If the layer specification string is:

( set.override | sds | syscall )

the layers that will be opened are:

sds, syscall

and the resulting file will not have record or block control words.

Numerics

  1. cblks is the stripe factor. The stripe factor is the number of blocks to be allocated on a given partition before cycling to the next partition.
  2. cbits cbits is a bit mask that indicates which partitions of the file system are to be used for striping.
  3. oflags_set
  4. oflags_clear The bits that are set in these numerics will either be set or cleared in the oflag argument of the ffopen during the open processing of the file. The setting or clearing of bits will occur between the 2 layers that the set layer is located between in the spec string. Examples of bits that are often set or cleared are the O_RAW( 0x40), O_LDRAW(0x80000).
  5. skip

    The bits that are set in this numeric indicate which layers are to be skipped in the ffopen processing. The numeric is interpreted as a bit mask, bit 0 representing layer 0, bit 1 representing layer 1, etc.

    It is sometimes benefical to request that certain layers not be invoked in certain instances.

    Example:

    In UNICOS 70 and later, Fortran direct access unformatted I/O uses a cache layer by default. If the user intends to have the eie cache invoked for such a file it would most likely be of no use to have the default cache layer invoked. The UNICOS library cache can be skipped by specifying:

    set.nocache

Aliases

Alias NameAlias Value
.raw.oflags_set|=o_raw
.noraw.oflags_clear|=o_raw
.ldraw.oflags_set|=(o_ldraw|o_raw)
.noldraw.oflags_clear|=o_ldraw
.stripe.oflags_set|=o_place
o_raw0x40
o_ldraw0x80000
o_place0x2000
nocachea.skip|=cachea
nocache.skip|=cache
nowa.skip|=wa
noevent.skip|=event
nosds.skip|=sds
cachea0x400000
cache0x80000
wa0x200000000
event0x80000000
sds0x800


event layer

General Description

The event layer monitors the I/O occuring between two other layers, much as procstat monitors I/O events between the program and the kernel. The event layer can be requested to gather total statistics only, or may also gather blow-by-blow events in a binary trace file which can be used to reproduce the I/O events.

Numerics

  1. file_num
    indicates which binary trace file name is to be used. This may be any value from 0 to 9. The value of file_num refers to which name given in the FF_IO_TRACE_FILE is to be used. FF_IO_TRACE_FILE may contain up to 10 file names which are referenced from 0 to 9. If FF_IO_TRACE _FILE is not set or does not contain a name referenced by file_num, the file name ffio.events.file_num will be used. The exeception to this rule is for file_num=0, where the file name defaults to ffio.events.

Options

notrace, trace
If trace is specified, the event layer generates a binary trace file of all I/O activity between the event's parent layer and child layer.
rtc, cpc
Indicates which type of clock to use for the times reported by the event layer. rtc uses the real time clock. cpc uses the cpu clock.
diag, nodiag, summary, brief
selects the level of output to be printed in $FF_IO_LOGFILE. nodiag request that no activity be reported. diag requests that all events types be reported. summary requests that all event types that have at least one occurance be reported. brief requests that only a one line summary for the layer activities be reported.
bytes, mbytes, gbytes, words, mwords, gwords, blocks
selects which units are to be used for reporting data transferred values to $FFIO_IO_LOGFILE. For units other than bytes, the quantities are always rounded up to the next multiple of the units.

Aliases

Alias NameAlias Value
.log.trace
.nolog.notrace

wa layer

General Description

The word addressable layer is an ffio implementation of the word addressable package. The ffopen is translated into an appropriate WOPEN, ffread into an appropriate READA, etc. The main intent in the creation of the wa layer was to allow developers and users an easy way to compare the I/O performance of other ffio layers with the I/O performance of a traditional and widely used Cray Research I/O package without having to modify their code.

Numerics

  1. blocks :
    The number of blocks for buffering.
  2. dn :
    The first 7 characters of the dataset name to use. Note that since dn is a numeric, this value must be the binary representation of the right 7 bytes of the string.

  3. dn_ext
    The 8th character of the dataset name to use. Note that since dn_ext is a numeric, this value must be the binary representation of the right most byte of the string.

Examples

Example 1 : Baseline case

The following C program generates a file by writing 1000 sequential records of 16384 bytes and then reads the file forward and backward.


	#include <fcntl.h>
	#include <unistd.h>
	main()
	{
		int fd, i ;
		char *data ;
 		int nrec, bytes_per_record ;

 		bytes_per_record = 16384 ;
 		nrec = 1000 ;
		
  		fd = ffopen("test.dat", O_RDWR|O_CREAT|O_TRUNC, 0600 ) ;

		data = (char *)malloc( bytes_per_record ) ;

  		for(i=0;i<nrec;i++) ffwrite(fd, data, bytes_per_record);

  		ffseek( fd, 0, SEEK_SET ) ;

  		for(i=0;i<nrec;i++) ffread(fd, data, bytes_per_record );

  		for(i=0;i<nrec;i++){
    		  ffseek( fd, (nrec-1-i)*bytes_per_record, SEEK_SET );
     		  ffread(fd, data, bytes_per_record ) ;
  		}

  		ffclose( fd ) ;
	}


The following lines generate the executable.


	cc -c example.c
	segldr -o example example.o libeag_ffio.a \
		-D"hardref=_evt_ffvect;hardref=_eie_ffvect"
	cc -o exammple example.c -lffio -leag_ffio


The following script will run the program with only the system layer


	#!/bin/csh
	setenv FF_IO_LOGFILE example.log
	setenv FF_IO_OPEN_DIAGS true
	setenv FF_IO_OPTS " *.dat ( syscall ) "
	#
	ja
	./example
	ja -cls > example.ja

Two files of interest, example.log and example.ja, are generated by the script.


	% cat example.log
	***********************************************************************
	FF_IO version 7.1BF1 Apr  6 1994  C90 UNICOS 80
	program=./example
	Thu Apr  7 11:24:25 1994

	FF_IO_DEFAULTS  =(NULL)
	FF_IO_OPTS      =*.dat ( syscall )
	FF_IO_LOGFILE    =example.log
	FF_IO_OPEN_DIAGS =true
	FF_IO_TRACE_FILE =(NULL)
	FF_IO_RECOVER_CMD=(NULL)

	
	ffopen(/tmp/jtmp.000186a/.assign)    ft:0    gt:246=KER
	/tmp/jtmp.000186a/.assign : will not use ffio
	

	ffopen(file.dat)    ft:0    gt:0
	             opening layer syscall
	file.dat : using layers : syscall 

The diagnostics in example.log indicate that the ffopen of the first file opened (temporary file for the assign command) will not use ffio. The ffopen of file test.dat matched the template *.dat, resulting in the invocation of the system layer for the file.


	
	% cat example.ja
	Operating System                 : sn4025 hot 8.1.0bw d81.21 CRAY C90
	Report Starts                    : 04/07/94 11:24:24
	Report Ends                      : 04/07/94 11:24:42
	Elapsed Time                     :           18      Seconds
	User CPU Time                    :            0.0262 Seconds
	System CPU Time                  :            0.8264 Seconds
	I/O Wait Time (Locked)           :           17.6853 Seconds
	I/O Wait Time (Unlocked)         :            0.0694 Seconds
	Data Transferred                 :            5.8594 MWords
	Maximum memory used              :            0.1953 MWords
	Logical I/O Requests             :         3003
	Physical I/O Requests            :         3013

The statistics in the ja output indicate the 3003 logical I/O requests transferred 5.8594 Mwords of data with 17.6853 seconds of locked I/O wait time.

Example 2: User striping

User striping allows a user to request that a file be allocated across multiple partitions of a file system. Typically, a file system is made up of many partitions with each partition residing on an independant disk and I/O channel. This will allow multiple requests to a file to occur simulataneously, increasing the effective transfer rate to the file. The first step with user striping is to determine the file system configuration. This is achieved with the df command, for which a sample output follows.



hot% df -p /usr/tmp
/usr/tmp     (/dev/dsk/usr_tmp ):   2891665 sectors       0 trks 2993492 I-nodes
                           total:   2992300 sectors      (0 trks)   2993600 I-nodes

        Big file threshold:            32768 bytes
        Big file allocation minimum:      24 blocks

        Allocation Strategy:    round robin files
                                round robin all user data

        Primary partitions allocation unit:         16K byte blocks

        part   start     total         free (%)          frags (%)      device
        ----  --------  --------  -----------------  ----------------  --------
           0         0    478768    471128 ( 98.4%)      1 (  0.000%)  utmp1230
           1    478768    478768    470800 ( 98.3%)      1 (  0.000%)  utmp2030
           2    957536    478768    470800 ( 98.3%)      1 (  0.000%)  utmp2130
           3   1436304    478768    470800 ( 98.3%)      1 (  0.000%)  utmp2230
           4   1915072    478768    396944 ( 82.9%)      1 (  0.000%)  utmp2330
           5   2393840    478768    466620 ( 97.5%)      1 (  0.000%)  utmp3030
           6   2872608    478768    456348 ( 95.3%)      1 (  0.000%)  utmp3330
           7   3351376    478768    456440 ( 95.3%)      1 (  0.000%)  utmp2031
           8   3830144    478768    456440 ( 95.3%)      1 (  0.000%)  utmp2131
           9   4308912    478768    456532 ( 95.4%)      1 (  0.000%)  utmp2231
          10   4787680    478768    456532 ( 95.4%)      1 (  0.000%)  utmp2331
          11   5266448    478768    456532 ( 95.4%)      1 (  0.000%)  utmp3031
          12   5745216    478768    456532 ( 95.4%)      1 (  0.000%)  utmp3331
          13   6223984    478768    456532 ( 95.4%)      1 (  0.000%)  utmp2032
          14   6702752    478768    456532 ( 95.4%)      1 (  0.000%)  utmp2132
          15   7181520    478768    470984 ( 98.4%)      1 (  0.000%)  utmp2232
          16   7660288    478768    471168 ( 98.4%)      1 (  0.000%)  utmp2332
          17   8139056    478768    471168 ( 98.4%)      1 (  0.000%)  utmp3032
          18   8617824    478768    471124 ( 98.4%)      1 (  0.000%)  utmp3332
          19   9096592    478768    470936 ( 98.4%)      1 (  0.000%)  utmp2033
          20   9575360    478768    471132 ( 98.4%)      1 (  0.000%)  utmp2133
          21  10054128    478768    471124 ( 98.4%)      1 (  0.000%)  utmp2233
          22  10532896    478768    471168 ( 98.4%)      1 (  0.000%)  utmp2333
          23  11011664    478768    471172 ( 98.4%)      1 (  0.000%)  utmp3033
          24  11490432    478768    471172 ( 98.4%)      1 (  0.000%)  utmp3343



From the df output it is observed that the file system /usr/tmp has 25 partitions (0-24), each residing on independent disks/channels (the device for each partition is a unique). When selecting a cblks for user striping, a good first guess is to use a multiple of the big file allocation minimum size from the df output. cblks is the number of blocks that the kernel will allocate to a given partition before rotating to the next partition specified in cbits. The selection of cbits can be a bit complicated. Typically, one should avoid the use of partition 0 since it is heavily used by the kernel to store inode information for the file system. If a particular partition has little free space, it too should be avoided. Once the user has determined the partitions to be used, the cbits value needs to be calculated. The cbits value is a mask indicating which partitions of a file system are to be used for user striping the file. The rightmost bit of the cbits word represents partition 0, the second bit from the right, partition 1, etc. An example showing the computation of cbits for the selection of partitions 1,2,4,6,7, and 8 follows.


		set.cbits=0x1d6.cblks=92
    
       0x    1    d     6  <-- hexidecimal value of 4 bit quantites
          0001 1101  0110
          |||| ||||  ||||_ bit for partition 0
          |||| ||||  |||__ bit for partition 1
          |||| ||||  ||___ bit for partition 2
          |||| ||||  |____ bit for partition 3
          |||| ||||
          |||| ||||_______ bit for partition 4
          |||| |||________ bit for partition 5
          |||| ||_________ bit for partition 6
          |||| |__________ bit for partition 7
          ||||
          ||||____________ bit for partition 8
          |||_____________ bit for partition 9
          ||______________ bit for partition 10
          |_______________ bit for partition 11
       

Like any ffio numeric value, the user also could specify cbits in base 8 (octal), base 10 (decimal), or base 16 (hexidecimal).

For the above example, the following are all equvialent since they all represent the same integer value and bit pattern.

		set.cbits=0x1d6
		set.cbits=0o726
		set.cbits=470
       

It may be more convenient to use an octal representation of cbits since the fck command returns the cbits value in octal format.

We can run the program in example 1 with user striping, using a cbits value of 0xfffe, which will request that the file be allocated across partions 1 through 15, and a cblks value of 92. The following script will run the program with user striping and the syscall layer:



	#!/bin/csh
	setenv FF_IO_LOGFILE example.log
	setenv FF_IO_OPEN_DIAGS true
	setenv FF_IO_OPTS " *.dat ( set.cblks=92.cbits=0xfffe | syscall ) "
	#
	ja
	./example
	ja -cls > example.ja

Using the fck command we can verify that the file was indeed user striped. The fck output indicates that cblks is 92 and cbits is 0177776(which is equivalent to 0xfffe). It also can be observed that the file striped as predicted, starting with 92 blocks on slice (partition) 1, then 92 blocks on slice 2, etc.


hot% fck -ilbp file.dat


File: file.dat       Inode:     33       size:  16384000
dev:       34/50     rdev:      0/0      links: 1
blocks:     4052     cblks:     92       cbits: 0177776
mode:   100600       perm:     600       type:  regular  
UID:       210       GID:        0
acid:          210   gen: 387311198

inode changed: Thu Apr 21 10:06:50 1994
last modified: Thu Apr 21 10:06:50 1994
last accessed: Thu Apr 21 10:06:50 1994

Item  Start blk Count  Total  Slc Log. Dev Phy. Dev Iopth Unit Cyl   Trk Sectors
----- --------- ----- ------- --- -------- -------- ----- ---- ----- --- -------
data  *********    92      92  1   usr_tmp utmp2030  2030    0    41   0  14-22 
                                           utmp2030  2030    0    41   1   0-13 
data  *********    92     184  2   usr_tmp utmp2130  2130    0    41   0  13-22 
                                           utmp2130  2130    0    41   1   0-12 
data  *********    92     276  3   usr_tmp utmp2230  2230    0    41   0  14-22 
                                           utmp2230  2230    0    41   1   0-13 
data  *********    92     368  4   usr_tmp utmp2330  2330    0   442   1   8-22 
                                           utmp2330  2330    0   443   0   0-7  
data  *********    92     460  5   usr_tmp utmp3030  3030    0    64   0   1-22 
                                           utmp3030  3030    0    64   1       0
data  *********    92     552  6   usr_tmp utmp3330  3330    0   119   1  16-22 
                                           utmp3330  3330    0   120   0   0-15 
data  *********    92     644  7   usr_tmp utmp2032  2032    0   119   1  16-22 
                                           utmp2032  2032    0   120   0   0-15 
addr  *********     1     645  0   usr_tmp utmp1234  1234    0    41   0      21
data  *********    92     737  8   usr_tmp utmp2132  2132    0   119   1  16-22 
                                           utmp2132  2132    0   120   0   0-15 
data  *********    92     829  1   usr_tmp utmp2030  2030    0    41   1  14-22 
                                           utmp2030  2030    0    42   0   0-13 
data  *********    92     921  2   usr_tmp utmp2130  2130    0    41   1  13-22 
                                           utmp2130  2130    0    42   0   0-12 
data  *********    92    1013  3   usr_tmp utmp2230  2230    0    41   1  14-22 
                                           utmp2230  2230    0    42   0   0-13 
data  *********    92    1105  4   usr_tmp utmp2330  2330    0   443   0   8-22 
                                           utmp2330  2330    0   443   1   0-7  
data  *********    92    1197  5   usr_tmp utmp3030  3030    0    64   1   1-22 
                                           utmp3030  3030    0    65   0       0
data  *********    92    1289  6   usr_tmp utmp3330  3330    0   120   0  16-22 
                                           utmp3330  3330    0   120   1   0-15 
data  *********    92    1381  7   usr_tmp utmp2032  2032    0   120   0  16-22 
                                           utmp2032  2032    0   120   1   0-15 
data  *********    92    1473  8   usr_tmp utmp2132  2132    0   120   0  16-22 
                                           utmp2132  2132    0   120   1   0-15 
data  *********    92    1565  9   usr_tmp utmp2232  2232    0   119   1  16-22 
                                           utmp2232  2232    0   120   0   0-15 
data  *********    92    1657 10   usr_tmp utmp2332  2332    0   119   1  16-22 

The ja output indicates that the I/O performance is very similar to the example run without user striping. This results from the lack of any asynchronous I/O requests from the program to the kernel. For this example it makes no difference if the synchronous requests are satisfied by 1 channel or 15 channels. Since each request is snychronous there is no possibility for multiple requests to be satisfied concurrently which would take advantage of the multiple channels.


Operating System                 : sn4025 hot 8.1.0cd u81.2 CRAY C90
Report Starts                    : 04/21/94 10:06:32
Report Ends                      : 04/21/94 10:06:49
Elapsed Time                     :           17      Seconds
User CPU Time                    :            0.0277 Seconds
System CPU Time                  :            0.7761 Seconds
I/O Wait Time (Locked)           :           17.0975 Seconds
I/O Wait Time (Unlocked)         :            0.1331 Seconds
Data Transferred                 :            5.8594 MWords
Maximum memory used              :            0.1563 MWords
Logical I/O Requests             :         3003
Physical I/O Requests            :         3056

Example 3: Using eie and user striping

We now can run the program with FF_IO_OPTS set as follows:


	setenv FF_IO_OPTS "*.dat ( eie.mem.diag:92:20:5:0 | set:92:0xfffe )"

Again we have the two files of interest, example.log and example.ja.

%cat example.log
***********************************************************************
FF_IO version 7.1BF1 Apr  6 1994  C90 UNICOS 80
program=./example
Thu Apr  7 11:24:43 1994

FF_IO_DEFAULTS  = (NULL)
FF_IO_OPTS      = *.dat ( eie.mem.diag:92:20:5:0 | set:92:0xfffe )
FF_IO_LOGFILE    =example.log
FF_IO_OPEN_DIAGS =true
FF_IO_TRACE_FILE =(NULL)
FF_IO_RECOVER_CMD=(NULL)


ffopen(/tmp/jtmp.000186a/.assign)    ft:0    gt:246=KER
     checking templates :*.dat
      != *.dat
/tmp/jtmp.000186a/.assign : will not use ffio


ffopen(file.dat)    ft:0    gt:0
     checking templates :*.dat
      == *.dat
       requested layers :set|eie
             opening layer
eie.mem.diag.save.nobpons.wb.rls.listio.bytes:92:20:5:0::
             opening layer system
file.dat : using layers : eie syscall 


eie_close EIE final stats for file         file.dat
eie_close  Used private cache
eie_close  20 mem pages of 92 blocks (23 sectors), max_lead = 5 pages
eie_close  advance reads used/started :      108/     113   95.58%
eie_close  read  hits/total           :     1998/    2000   99.90%
eie_close  write hits/total           :      998/    1000   99.80%
eie_close  Data transferred (    bytes )  program --> eie  --> syscall 
eie_close                                    16384000  16384000
eie_close                                    32768000  25427968
eie_close                                 program <-- eie  <-- syscall 

Again, the file test.dat matched the template *.dat and the layers eie and set were opened for the file. When the file was closed, the diag option to eie generated the above output, informing the user of the activity of the cache. The advance reads used/started line reflects the percentage of correct anticipation by the cache of pages that it prereads. The read and write hits indicate that the percentage of incoming requests that were satisfied by the cache without having the cache request any data from it's child layer. The "Data transferred" lines indicate the amount of bytes that was read and written by the cache's parent, and the amount of read/written by the cache from it's child.


%cat example.ja
Operating System                 : sn4025 hot 8.1.0bw d81.21 CRAY C90
Report Starts                    : 04/07/94 11:24:43
Report Ends                      : 04/07/94 11:24:45
Elapsed Time                     :            2      Seconds
User CPU Time                    :            0.0621 Seconds
System CPU Time                  :            0.0715 Seconds
I/O Wait Time (Locked)           :            2.4317 Seconds
I/O Wait Time (Unlocked)         :            0.1833 Seconds
Data Transferred                 :            4.9854 MWords
Maximum memory used              :            1.0859 MWords
Logical I/O Requests             :          126
Physical I/O Requests            :          125

The ja statistics now report that 126 logical I/O requests transferred 4.9854 Mwords of data with 2.4317 seconds of Locked I/O wait time. The differences in the ja statistics reflect the effects of the eie cache. There are only 126 logical I/O requests, versus the original 3003, since the cache is buffering the programUs 16384 byte requests into cache page equests that are 92 blocks (376832 bytes) in size. The Data Transferred is less since data was used multiple times out of the cache, meaning the program did not have to go to the kernel for the data that was reused. The I/O wait time was much less since the cache prefetched the bulk of the data with asyncrounous requests, which were completed by the time the program actually needed the data.

There are several numbers that can be examined for a sanity check. In the example run without the eie cache, ja reported 5.8594 Mwords of Data Transferred. Since the program issued requests to the eie cache, rather than straight to the kernel, eie should report the same amount of data being requested by the program. Adding the 16384000 bytes written and 32768000 bytes read by the program (as reported by the eie cache) totals 49152000 bytes = 46875Mbytes = 5.8594 Mwords, producing a number similar to the non-cached Data Transferred. It also should be noted that the "Maximum memory used" increased from 0.1952 Mw in the non-eie example to 1.0859 Mw in the eie example. The increase was caused by the additional memory used by the eie cache pages (92*512*20/1048576).

The fck output for this example is identical to that of example 2, since the user striping parameters are identical (cbits=0xfffe and cblks=92). Unlike example 2, user striping does provide a benifit in this example since the eie cache is issuing asynchronous requests to the kernel to preread data into the cache. Many of these cache page prereads may be satisfied in parallel since the file is laid out on the file system with consecutive pages residing on independent channels. Note that the cache page size is the same as the striping factor, cblks, which places one full cache page per channel before cycling back to the first partition. This means each logical I/O request results in one physical I/O request, which is reflected in the ja output. Alternate stratagies may be implemented which stipes each cache page over multiple channels (using a cache page size that is a multiple of the stripe factor). This will result in each logical I/O request to read a cache page, requiring multiple physical I/O requests.

Example 4: Using the event layer

This example is identical to example 3 with the exception that the event layer will be used both before and after the eie cache to monitor the I/O events coming into the cache from the program and the I/O events issued to the syscall layer by the eie cache.

The following script is used to run the program:


#!/bin/csh
setenv FF_IO_LOGFILE example.log
setenv FF_IO_OPEN_DIAGS true
setenv FF_IO_DEFAULTS " event.summary ,eie.mem.diag:92:20:5:0 ,set:92:0xfffe "
setenv FF_IO_OPTS "*.dat ( set | event | eie | event | syscall )"
#
ja
./example
ja -cls > example.ja

The ja output for this example is nearly identical to example 3 since the only the event layer was added which uses very little cpu time and issues only a few logical I/O requests.

The event layer before the eie cache will output a summary similar to the following.

 
evt_close(file.dat) program<-->eie (  49152000 bytes)/(    2.47 s)=19899595.87 bytes/s

    open flags=0x0000400000002342=RAW+RDWR+CREAT+TRUNC+PLACE
    sector size =4096(bytes)
    cblks =92  cbits =0x000000000000fffe
    current file size =16384000 bytes   high water file size =16384000 bytes

    function       times  ill    wait      bytes       bytes   min     max     avg     all
                  called  formed time  requested   delivered request request request  hidden
       open            1         0.05
       seek         1001
       write        1000      0  1.45   16384000    16384000   16384   16384   16384
       read         2000      0  0.97   32768000    32768000   16384   16384   16384
       close           1         0.00
       extends      1000

The first line of the event layer output reports the file being closed and its layer position. The string "program<-->eie" indicates that this event summary is for the event layer between the program and the eie layer. This event layer reports statistics that would be expected from the program being run. There were 1000 writes of 16384 bytes each, 2000 reads of 16384 bytes each, and 1001 seeks. The event layer below the cache reports the I/O requests that the eie cache is making to the kernel via the syscall layer. It is observed from the event layer output that the eie cache is using listio requests rather that reads and writes. The 1000 program writes have been reduced to 44 asyncrounous writes to the kernel, 43 of which were completed by the time the asynchronous write was recalled. This is the write behind feature of the eie cache. There were 2 synchronous reads of cache pages issued by the cache before the sequential access was detected and 66 asynchronous reads then followed. 30 of the asynchronous reads were completed by the kernel before the data was actually needed by the eie cache to satisfy a program request for data. This is the benefit of the read ahead feature of the cache. The recall wait time under the fcntl heading is the amount of time spent waiting for the kernel to complete the asynchronous reads and writes.


evt_close(file.dat  )  eie <-->syscall  (  41811968 bytes)/(    2.42 s)=17277672.64 bytes/s

    open flags=0x0000400000002342=RAW+RDWR+CREAT+TRUNC+PLACE
    sector size =4096(bytes)
    cblks =92  cbits =0x000000000000fffe
    current file size =16384000 bytes   high water file size =16384000 bytes

    function       times  ill    wait      bytes       bytes   min     max     avg     all
                  called  formed time  requested   delivered request request request  hidden
       open            1         0.05
       listio        112         1.58
          seek        27
          writea      44       0        16384000    16384000  180224  376832  372363      43
          read         2       0          753664      753664  376832  376832  376832
          reada       66       0        24674304    24674304  180224  376832  373853      38
       fcntl
          recall     110         0.79
          other        3         0.00
       flush           1         0.00
       close           1         0.00
       extends        44


Original author John Bauer. by Kevin Thomas.