File loading techniques
Good planning and correct technique are essential when loading files. The process of creating a file is complex, and failure to observe good practice can be both expensive and time consuming.
This page recommends a number of steps that you can follow to avoid problems while loading files. Techniques are also described for refining file parameter settings and space estimates based on the results of experiments made with sample data from the file being loaded. These experiments generally are not necessary if you know the exact format of the data in advance.
Verifying data formats and load programs
Rocket Software recommends that you take the following steps to verify that the input data is properly formatted and that the load program is functioning properly:
- Check the format of the input data, if possible, using the IBM IEBPTPCH utility or a sample file load program with the P statement.
- Run a sample load to create 100 to 1000 Model 204 records.
If records are being loaded with a file load program, check the counters printed at the end of the program.
- After loading the records, issue the DISPLAY FIELD command:
DISPLAY FIELD (DEFN) ALL
And run a SOUL request of the form:
BEGIN GET.RECS: FIND ALL RECORDS END FIND FOR.LOOP: FOR 10 RECORDS IN GET.RECS SKIP 1 LINE PRINT ALL INFORMATION END FOR END
This request helps to verify that field names are properly defined and that the data actually loaded into the Model 204 file looks right. A Host Language Interface program can be substituted to print the records.
- Complete the load for the sample 100 to 1000 records by running the last four job steps of the deferred update sequence.
- Check NUMERIC RANGE retrieval fields for invalid input data. For example, if
DATEhave the NUMERIC RANGE retrieval option, this SOUL request prints those values that are not valid numbers:
BEGIN FIND.AGE: FIND ALL RECORDS FOR WHICH AGE++ = NON NUMERICAL END FIND AGE.LOOP: FOR EACH RECORD IN FIND.AGE PRINT AGE SKIP 1 LINE END FOR FIND.DATE: FIND ALL RECORDS FOR WHICH DATE++ = NON NUMERICAL END FIND DATE.LOOP: FOR EACH RECORD IN FIND.DATE PRINT DATE END FOR END
- Check fields that have both the BINARY and FRV attributes for noncompressible values with this request:
BEGIN FOR.LOOP: FOR EACH VALUE OF fieldname PRINT VALUE IN FOR.LOOP END FOR END
Estimating load times and working storage
There is no exact formula for calculating how long it takes to load a file. The most reliable method for arriving at an estimate is to extrapolate from the CPU and elapsed time used by a sample load. The larger the sample, the better the estimate. For very large files, a sample that fills one segment (49,152 records) is recommended.
Estimating the load time of a deferred update data set
The size of the deferred update data set can be estimated from the number of deferred update records produced by the sample load. If the deferred update data set is on tape, more than one volume can be specified in the JCL.
Estimating working storage requirements
The size required for the sort work data sets can be estimated after the sample load, or it can be determined more accurately after each main load step from the actual number of deferred update records.
Loading large files
For purposes of this discussion, a large file is one that contains more than one segment of records (that is, more than 49,152 records). Load files one segment at a time. This ensures the most efficient use of Table D space and provides convenient recovery points. Load files containing less than one segment all at once.
Loading files with ORDERED fields
For very large files that have ORDERED fields, retrievals are most efficient for the ORDERED fields if the file is loaded multiple segments at a time. There is no advantage to loading more than 16 segments at a time.
Take the following steps for each segment:
- Load the data into Tables A and B (Phase one).
- Dump the file (see File dumping and restoring). The first step of the File Load sequence is the most expensive. Dumping after this step provides a backup in case of failures in the later index steps.
- Run all the sort and index steps (Phase two).
- Dump the file again. This preserves all the work done so far in case of an error in the main load step for the next segment.
Keep the output data set of each sort until the segment has been completely loaded and dumped. If Table D fills up during an index step, the file can be restored as of the end of the last main load step. After Table D is increased, the index step can be rerun without the sort. If Table C fills up, the entire File Load must be rerun with a larger value for CSIZE.
Refining the file parameter settings
After sample records have been loaded, a number of file parameters can be reevaluated based on actual data. A few parameters can be viewed to determine whether any incorrect estimates have been made. These parameters are grouped by file table and are described in the following sections.For more detailed information about setting the file parameters, see File size calculation in detail.
Loading Large Object data
A Model 204 file that contains Large Object data can be reorganized using the standard PAI and FLOD procedure. This is identical to previous releases of Model 204, and no change to your existing unload or load code is required.
To display complete Large Object data, you must use a PAI statement, not a PAI INTO statement.
The output from the PAI statement for a field that is not Large Object data looks like this:
FIELDNAME = FIELDVALUE
The output from the PAI statement for a Large Object field looks like this:
FIELDNAME = <descriptor> LOB data line 1 LOB data line 2 ... LOB data line n
When you store a Large Object value in Table E, a Large Object descriptor is stored in Table B. The descriptor contains the length, reserve, and a pointer to the Large Object data in the Table E page. The descriptor is an internal Model 204 structure. Usually it is of no concern to the File Manager or FLOD programmer, as the descriptor is built by the PAI statement that unloaded the data.
However, if loading Large Object data from an external source you may need to build the Large Object data descriptor be stored in Table B.
The actual Large Object data is on subsequent records; as many as are required to hold the Large Object value. This depends on the OUTMRL string of the output data set to which the PAI output is directed.
When a FLOD executes a statement to load a field value, it checks whether the field is defined as Large Object, and if so the Large Object fieldname=value pair is created and stored in Table B.
Partial file loads
In some applications, only a portion of the records is available when the file is initially loaded. In some cases, records are added later in large batches; in other cases, records are added Online. When such files are created, set most parameters to the values they would have if the file were to be fully loaded. BSIZE and DSIZE might be set lower so that disk storage is not tied up until actually needed.
Table A parameters
If the size of Table A (
ATRPG + FVFPG + MVFPG) is too small, the value of ARETRIES is nonzero. The ARETRIES parameter (Table A page retries) indicates when one or more sections of Table A is filling up.
ARETRIES is a counter that is incremented each time Model 204 fails in an attempt to store an item on a page and attempts to store the item on another page. Without reevaluating the input data, you cannot determine which section is filling up. Each retry indicates a slight degradation in update or retrieval efficiency for field names or for coded or FRV fields.
If the ATRPG parameter is greater than 1, page retries may indicate inefficiency of all field name processing. If ARETRIES is very high (near ASIZE) and new field names or values are anticipated in the file, increase the Table A size parameters to avoid an eventual table full condition.
Table B parameters
The BRECPPG parameter specifies the number of Table B records per page and the BRESERVE parameter specifies the Table B reserved space per page. You can view the value of these parameters with the TABLEB command.
Viewing records and free space with the TABLEB command
The TABLEB command displays information about the records and free space in Table B. The TABLEB output always includes the following items:
- Average free space per page
- Average free record slots per page
- Number of pages processed
- Value of BRECPPG (number of Table B records per page)
- Value of BRESERVE parameter (Table B reserved space per page)
The format of TABLEB is:
TABLEB [LIST | OVERFLOW | MASTER | RECORD LENGTH | PAGE page_number | PAGES page_number1 TO page_number2]
- LIST produces one line for each page in Table B. If no other option is specified, TABLEB scans all of Table B, starting with the lowest active page and ending with the highest one. For each page, the page number, amount of free space, and number of free slots are shown. This detailed listing is followed by the list of averages and totals specified above.
If the file being scanned is sorted, each printed detail line includes an indication of the type of page that the line describes. The OVERFLOW and MASTER options list only pages from the OVERFLOW or MASTER area of a sorted file.
- PAGE and PAGES allow you to set the range of Table B pages to scan. If any of the pages specified are inactive, an error message is issued and the TABLEB command is ignored.
- RECORD LENGTH (which can be abbreviated as RECLEN) calculates the average length of a logical record (a record that includes all of its extensions) in Table B.
This length is exact unless DELETE RECORDS or IFDSET is used in the file. In this case, RECLEN is too long, because space from the deleted records is never released.
RECLEN is valid only if all of Table B is scanned (that is, only if you do not specify the PAGE or PAGES options).
Checking Table B parameters
The two basic goals in loading Table B properly are:
|Wasting record numbers||Number of free slots per page shown by the TABLEB command is not zero.||
Record numbers are being wasted.
|Creating extension records||If, after loading Table B, the EXTNADD value is not zero||Extension records have been created.|
The following section discusses how to prevent these two conditions.
Preventing wasted record numbers
You can recover wasted record numbers in a file by setting BRECPPG to a smaller value or, less often, BRESERVE.
If, according to the TABLEB output, space is being wasted on each Table B page and no record numbers have been skipped, then BRECPPG might be set too low.
Preventing extension records
If the value of EXTNADD is too high (that is, if you have too many extension records) reload the file with a smaller BRECPPG or a larger BRESERVE.
Note: If a few records within the file are much longer than the average record, it might be most efficient to allow the longest records to create extension records. For more information about sizing BRESERVE to avoid extension records, see Computing BRESERVE (reserved Table B space).
Sizing Table B
This section describes the BSIZE and BHIGHPG parameters. For more information about sizing Table B, see Sizing Table B.
Using BHIGHPG to size BSIZE
To check your estimate of BSIZE, look at the BHIGHPG parameter. In ordinary and sorted files, BHIGHPG is the number of the last page used, one less than the number of pages used. If this number is significantly different from the estimate of space required for the sample, then adjust BSIZE accordingly.
Checking for overflow records in sorted files
In sorted files, examine OVFLADD (overflow records) and EOVFLADD (extra overflow records). The values of both of these parameters should be zero after the initial load. If they are not, either the last master area has been filled or, more likely, the input data was not properly sorted. The former condition occurs only if BHIGHPG is one less than BSIZE. For more information about sorted files, see Defining fields manually.
Checking for spill records in hash key files
Check SPILLADD (spills) in hash key files. Spill records are retrieved by their hash keys more slowly than other records. If the number of spills is not zero, raise BRECPPG, BSIZE, or both. Only raise BRECPPG if the TABLEB command shows pages that have no unused record slots but lots of available space.
Table C parameters
To monitor Table C, use the CRETRIES parameter and the TABLEC command, as described in the following sections.
Monitoring Table C with the CRETRIES parameter
As in Table A, a page retry statistic (the CRETRIES parameter) is provided. Each retry indicates a slight degradation in indexed retrieval speeds. Perhaps more significantly, page retries are a signal that Table C is becoming crowded. Although these figures vary from file to file, experiments have produced results similar to those in Monitoring Table C with the CRETRIES parameter (from a file with
As shown in the figure above, page retries usually are not significant when utilization is below 70 percent. At approximately 72 percent, the number of retries begins to grow more rapidly as utilization increases (new KEY values are added). The point at which retries sharply begin to increase is 80 percent; at 90 percent the slope further increases. A rough guideline for users concerned with access performance is to aim for 75 percent utilization. If space is the only constraint, you should be able to achieve 85 percent utilization. Remember that this figure varies.
Utilization of Table C
Table C never can be 100 percent utilized. A Table C full condition usually is reached when utilization exceeds 85 percent. Utilization of up to 95 percent has been achieved, but the precise point varies at which a Table C full condition results.
Table C space requirements
New KEY values require only two slots in Table C for small files. If the file size multiplier, discussed in Computing the file size multiplier (N), is greater than 1, KEY values might require more than two slots. Therefore, at a rate dependent upon the level and nature of update activity, the number of slots used increases.
Use the TABLEC command to see what percentage of Table C space actually is used.The TABLEC command displays the following information about Table C usage:
- Number of slots available in Table C
- Number of slots used
- Percentage of Table C used
- Total number of property entries
- Number of pages with spills
- Number of primary property entries with overflow
- Percentage of property entries with overflow
OPEN CENSUS TABLEC M204.0638: NUMBER OF SLOTS = 102400 M204.0639: SLOTS USED = 76800 M204.0640: PERCENTAGE OF TABLE C USED = 75 M204.1833: THE TOTAL NUMBER OF PROPERTY ENTRIES IS 7496 M204.1874: THE NUMBER OF PAGES WITH SPILLS IS 33 M204.1834: 2500 PRIMARY PROPERTY ENTRIES HAVE OVERFLOW M204.1871: THE PERCENTAGE OF PROPERTY ENTRIES WITH OVERFLOW IS 33
Detecting overflows with the TABLEC command
The TABLEC command reports on property entry overflows vs. number of pages with spills. As described in, Sizing Table C, Table C contains property entries and segment entries.
A field name = value pair stored in Table C normally consists of one property entry for the pair, and as many segment entries as there are segments having records with that field value. If there is no space available to store an additional segment entry on the page where the property entry resides, the segment entry is created on the next page that has sufficient space.
When this occurs, the property entry page is said to have spilled. Model 204 creates a duplicate property entry on the page spilled onto along with the segment entry placed there. The original property entry is updated to point to this new property entry as an overflow entry.
The effects of this overflow are:
- Additional space is required for the overflow property entry.
- Additional I/O is involved in retrieving the entries during FIND processing.
Decreasing the number of overflows
Because property entry distribution in Table C is entirely dependent on the data values and the hashing algorithm, avoiding overflows might be impossible. However, increasing the size of Table C, which requires recreating the file, might provide a greater dispersal of the Table C entries and, therefore, fewer overflows. As Table C fills and overflows increase, you might have to choose between disk space requirements (not recreating the file) and FIND processing performance (recreating the file).
The number of pages with spills, reported by TABLEC, indicates how uniformly the Table C space is being filled. If the percentage of property entries having overflow is high, and the number of pages with spills is also high, then the file is apparently filling uniformly and might become full if substantial additions to Table C are processed. In this case, you might need to re-create the file and increase Table C.
If the percentage of property entries having overflow is high, but the number of pages with spills is low, then a "hot spot" or "cluster" exists due to many of the data values hashing to the same page(s). In this case, the file might not be in danger of filling, but performance might be further degraded by the cluster. Experimenting with alternative Table C sizes might alleviate the cluster, but some clusters are probably unavoidable.
In the previous sample TABLEC output, both the percentage of overflows and the number of page spills are relatively high. If substantial additional updating to Table C is anticipated, the percentage of Table C used climbs, and any performance problems caused by too many overflows might be compounded by excessive page retries. If the concern for performance outweighs the need for disk space, re-create the file with a larger Table C.
In the following sample output, the percentage of overflows is relatively high but the number of page spills is low and the values are clustered. What performance degradation exists might or might not be improved by increasing Table C:
NUMBER OF SLOTS = 102400 SLOTS USED = 76800 PERCENTAGE OF TABLE C USED = 75 THE TOTAL NUMBER OF PROPERTIES IS 7496 800 PRIMARY PROPERTY ENTRIES HAVE OVERFLOW THE PERCENTAGE OF PROPERTY ENTRIES WITH OVERFLOW IS 10 THE NUMBER OF PAGES WITH SPILLS IS 2
Table D parameters
Only one parameter can be viewed to check the Table D size estimate. DPGSUSED shows the number of Table D pages used and can be compared directly to estimates of DSIZE for the sample loaded. The amount of Table D used can be accurately determined only after a complete segment has been loaded.
Table D usage does not increase linearly within a segment.
The estimate for the Ordered Index B-tree node space (see Calculating Ordered Index space) can be checked by viewing the OINODES parameter. OINODES shows the number of Table D pages used for the Ordered Index B-tree. The exact number of pages necessary for the Ordered Index B-tree can be accurately determined only after each of the values of all the ORDERED fields has been loaded.