Fast/Unload DATESTAT analysis: Difference between revisions
(Automatically generated page update) |
(Automatically generated page update) |
||
Line 70: | Line 70: | ||
</nowiki></p> | </nowiki></p> | ||
For the meaning of the components of these formats, see | For the meaning of the components of these formats, see | ||
[[#dtfmt|Datetime formats]]. | [[Fast/Unload datetime processing considerations#dtfmt|Datetime formats]]. | ||
<div id="dtrep"></div> | <div id="dtrep"></div> |
Revision as of 19:40, 15 January 2015
If the DATESTAT statement is present in the FUNIN dataset, Fast/Unload will determine which fields in the file contain date values. The determination of whether a field contains a date value is done in two passes; the first pass examines 1000 evenly-distributed records in the file, and the second pass analyzes all records. At the end of each pass, certain fields are retained for date analysis and reporting. A field is retained if:
- for pass one, the field is not found in the 1000 records
- at least one value of the field is found in the pass which
conforms to a date format, except:
* a field is not a date if all dates were all numeric, there were more than 5 distinct non-date values, and less than 50% of the instances are dates. This is to attempt to avoid situations in which a field such as zip code could be treated as a 5 digit (YYDDD) date.
Note that Blob fields are not candidates for DATESTAT analysis.
The analysis keeps tracks of various totals that are used in reporting, and also keeps some of the values found in the field. Up to 36 date formats, 20 nondate values, and 120 year samples will be kept. When these overflow, the least-recently-found year and nondate samples are discarded, and the most-recently-found date format sample is discarded. Also, only up to 22 characters of nondate values are kept, so two different values which are the same in the first 22 characters will be considered to be the same value.
The date formats searched for consist of the following list, plus each one of the formats in this list, followed by a blank and time in the form HH:MI:SS, if the date format contains a character other than M, D, or Y, or followed immediately by time in the form HHMMSS otherwise.
MON DAY YY MON DAY YYYY DAY MON YY DAY MON YYYY DAY MONTH YY DAY MONTH YYYY DAY MON, YY DAY MON, YYYY DAY MONTH, YY DAY MONTH, YYYY DDMMYY DDIMMIYY DDMMYYYY DDIMMIYYYY MONTH DAY YY MONTH DAY YYYY MMDDYY MMIDDIYY MMDDYYYY MMIDDIYYYY YYDDD YY MON DAY YY MONTH DAY YYMMDD YYIMMIDD YYYYDDD YYYY MON DAY YYYY MONTH DAY YYYYMMDD YYYYIMMIDD
For the meaning of the components of these formats, see Datetime formats.
DATESTAT reporting
This section describes the reports created by the DATESTAT SUMMARY and DATESTAT DETAIL statements. Both of these statements indicate something about the "quality" of the date data in the field. The purposes of "date field quality" are:
- If you are running DATESTAT SUMMARY, and you have some values other than pure, you may want to do further investigation of the date fields.
- To indicate how much work might be involved to resolve the various values stored in the field. One tool to resolve the values is to run DATESTAT DETAIL; doing some ad-hoc work with User Language is another approach. The "worse" the quality of a field, the more work is likely to be required to resolve questions about the field values.
The quality is expressed as pure, good, fair, or poor; it is an attempt to measure the possible level of effort required to correct data on the file. The terms have the following meanings:
Pure | means that there are only date values, and all with a single date format. |
---|---|
Poor | means either there are more than 10 distinct non-date values and more than .01% of the field occurrences are non-date values, or the percent of date values which have uncommon formats, times the number of uncommon formats, is greater than .01%, or there are more than 20 date formats. If there are more than 20 date formats, or if the ratio of occurrences of the most common date format to occurrences of the next most common date format is less than 10 to 1, then Fast/Unload prints the string "Common date format not found". |
Fair | means either there are more than 5 distinct non-date values and more than .001% of the field occurrences are non-date values, or the percent of date values which have uncommon formats, times the number of uncommon formats, is greater than .001%. |
Good | is anything else. |
DATESTAT SUMMARY
DATESTAT SUMMARY creates a report with 1-3 lines for each date field, in the following form:
- ftag format (qual) span field...n: name
- tot occurrences of field format count
- nsmp occurrences of nondate value (len len): sample
Where:
- This line is always present; the components are:
- ftag is either "Common date format not found", if there are more than 20 date formats, or "Common format:" otherwise.
- format is the most commonly occurring date format, if ftag is "Common format:", or blank otherwise.
- qual is either "pure", "good", "fair", or "poor".
- span is either:
CENTSPAN: YYYY
or No YY occurrences
The former gives a recommended CENTSPAN, if the field has any 2-digit years, where YYYY is the oldest 2-digit year found; the latter occurs if the field does not have any 2-digit years. (2-digit years are interpreted using a CENTSPAN of 1900, or, if running DATESTAT after 1999, a CENTSPAN of -99.) - n is a sequential numbering of the fields; this can be correlated to an FSTATS report.
- name is the name of the field.
- This line is always printed if the qual
is not pure.
The components are:
- tot is the total number of occurrences of the field in the file
- format count is either n different formats found, if n is more than 1, or blank otherwise.
- This line is always printed if there are any nondate values for the field.
The line has one of the two following forms:
- nsmp occurrences of nondate value (len): truncated_value trunc_flag
- dist distinct nondate values
Where nsmp is the number of occurrences of a nondate value, if it appears to be exactly one nondate value in the field, and len is its length and truncated_value is the first 22 characters of its value. Trunc_flag is (first 22 bytes) if the length is greater than 22. In fact, some of the nondate values in the field may differ, if their first 22 characters are the same.
If there are 2 or more nondate values in the field, then the second form of this line is presented, where dist is the number of different nondate values that Fast Unload has kept as samples (up to the maximum of 20).
Note that for the recommended CENTSPAN, Fast/Unload assumes that all 2-digit years occur in the 1900s, or, if running DATESTAT after 1999, 99 years before the date of the run up to and including the date of the run.
DATESTAT DETAIL
DATEST DETAIL creates a report with 1 page for each date field, which includes:
- The number of field occurrences.
- The number of date occurrences, with the minimum and maximum date value, for both 4 digit ("YYYY") and 2 digit ("YY") years.
- A sample of the discovered date formats.
- A sample of the year values occurring in the field.
- A sample of the non-date values occurring in the field.
The sampling rules are described in Fast/Unload DATESTAT analysis; some values
may be discarded.
An asterisk (*
) is printed after a sample if
the sample occurrence count is incomplete.
Note that for the recommended CENTSPAN and the min/max values of the DETAIL report, Fast/Unload assumes that all 2-digit years occur in the 1900s, or, if running DATESTAT after 1999, 99 years before the date of the run up to and including the date of the run.