Fast/Unload DATESTAT analysis: Difference between revisions
(Automatically generated page update) |
m (minor formatting) |
||
(10 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
< | ==DATESTAT sampling== | ||
If the <var>[[Fast/Unload Extraction Language (FUEL)#datestat|DATESTAT]]</var> statement is present in the <code>FUNIN</code> data set, | |||
<var class="product">Fast/Unload</var> will determine which fields in the file contain date | <var class="product">Fast/Unload</var> will determine which fields in the file contain date | ||
values. | values. | ||
The determination of whether a field contains a date value is done | The determination of whether a field contains a date value is done in two passes: | ||
in two passes | <ul> | ||
examines 1000 evenly-distributed records in the file | <li>The first pass examines 1000 evenly-distributed records in the file.</li> | ||
pass analyzes all records. | |||
<li>The second pass analyzes all records.</li> | |||
</ul> | |||
At the end of each pass, certain fields are retained for date | At the end of each pass, certain fields are retained for date | ||
analysis and reporting. | analysis and reporting. A field is retained if: | ||
A field is retained if: | <ul> | ||
< | <li>For pass one, the field is not found in the 1000 records. </li> | ||
<li> | |||
<li> | <li>At least one field value that conforms to a date format is found in the pass being performed, except: | ||
<p class="note">A field is not a date if all dates were all numeric, there were more than 5 distinct non-date values, and fewer than 50% of the instances of the field are dates. This is an attempt to avoid situations in which a field such as zip code could be treated as a 5-digit (YYDDD) date.</p></li> | |||
< | |||
<li>The field is not a <var>Blob</var> field. Such fields are not candidates for <var>DATESTAT</var> analysis.</li> | |||
</ul> | |||
< | <p> | ||
< | |||
< | |||
The analysis keeps tracks of various totals that are used in | The analysis keeps tracks of various totals that are used in | ||
reporting, and also keeps some of the values found in the field. | reporting, and also keeps some of the values found in the field. | ||
Up to 36 date formats, 20 | Up to 36 date formats, 20 non-date values, and 120 year-samples | ||
will be kept. | will be kept. | ||
If these overflow, the least-recently-found year and non-date | |||
samples are discarded, and the most-recently-found date format | samples are discarded, and the most-recently-found date format | ||
sample is discarded. | sample is discarded. </p> | ||
<p class="note"><b>Note:</b> Since only as many as 22 characters of non-date values are kept, | |||
two different values | two different values that are the same in the first 22 characters | ||
are considered to be the same value.</p> | |||
<p>< | <p> | ||
The date formats searched for | The date formats that are searched are listed below, in somewhat arbitrary groupings for visual distinction. Each item in the list represents two formats: </p> | ||
<ul> | |||
<li>Format one is the terms that precede the square brackets.</li> | |||
<li>Format two is format one plus the immediately following square-bracketed contents.</li> | |||
</ul> | |||
<p class="code" | <p class="code">MON DAY YY[ HH:MI:SS] | ||
MON DAY YYYY | MON DAY YYYY[ HH:MI:SS] | ||
DAY MON YY | DAY MON YY[ HH:MI:SS] | ||
DAY MON YYYY | DAY MON YYYY[ HH:MI:SS] | ||
DAY MONTH YY | DAY MONTH YY[ HH:MI:SS] | ||
DAY MONTH YYYY | DAY MONTH YYYY[ HH:MI:SS] | ||
DAY MON, YY | DAY MON, YY[ HH:MI:SS] | ||
DAY MON, YYYY | DAY MON, YYYY[ HH:MI:SS] | ||
DAY MONTH, YY | DAY MONTH, YY[ HH:MI:SS] | ||
DAY MONTH, YYYY | DAY MONTH, YYYY[ HH:MI:SS] | ||
DDMMYY | |||
DDIMMIYY | DDMMYY[HHMISS] | ||
DDMMYYYY | DDIMMIYY[HHMISS] | ||
DDIMMIYYYY | DDMMYYYY[HHMISS] | ||
MONTH DAY YY | DDIMMIYYYY[HHMISS] | ||
MONTH DAY YYYY | |||
MMDDYY | MONTH DAY YY[ HH:MI:SS] | ||
MMIDDIYY | MONTH DAY YYYY[ HH:MI:SS] | ||
MMDDYYYY | MMDDYY[HHMISS] | ||
MMIDDIYYYY | MMIDDIYY[HHMISS] | ||
YYDDD | MMDDYYYY[HHMISS] | ||
YY MON DAY | MMIDDIYYYY[HHMISS] | ||
YY MONTH DAY | YYDDD[HHMISS] | ||
YYMMDD | |||
YYIMMIDD | YY MON DAY[ HH:MI:SS] | ||
YYYYDDD | YY MONTH DAY[ HH:MI:SS] | ||
YYYY MON DAY | |||
YYYY MONTH DAY | YYMMDD[HHMISS] | ||
YYYYMMDD | YYIMMIDD[HHMISS] | ||
YYYYIMMIDD | YYYYDDD[HHMISS] | ||
YYYY MON DAY[ HH:MI:SS] | |||
YYYY MONTH DAY[ HH:MI:SS] | |||
YYYYMMDD[HHMISS] | |||
YYYYIMMIDD[HHMISS] | |||
</p> | |||
For the meaning of the components of these formats, see | For the meaning of the components of these formats, see | ||
[[Fast/Unload datetime processing considerations#dtfmt|Datetime formats]]. | [[Fast/Unload datetime processing considerations#dtfmt|Datetime formats]]. | ||
< | ==<b id="dtrep"></b>DATESTAT reporting== | ||
This section describes the reports created by the <var>DATESTAT SUMMARY</var> | |||
and <var>DATESTAT DETAIL</var> statements. | |||
This section describes the reports created by the DATESTAT SUMMARY | |||
and DATESTAT DETAIL statements. | |||
Both of these statements indicate something about the "quality" of the | Both of these statements indicate something about the "quality" of the | ||
date data in the field. | date data in the field. | ||
The purposes of "date field quality" are: | <p> | ||
< | The purposes of "date field quality" are:</p> | ||
<li>If you are running DATESTAT SUMMARY, and you have some values other | <ul> | ||
<li>If you are running <var>DATESTAT SUMMARY</var>, and you have some values other | |||
than <b>pure</b>, you may want to do further investigation of the | than <b>pure</b>, you may want to do further investigation of the | ||
date fields. | date fields.</li> | ||
<li>To indicate how much work might be involved to resolve the various | <li>To indicate how much work might be involved to resolve the various | ||
values stored in the field. | values stored in the field. | ||
One tool to resolve the values is to run DATESTAT DETAIL | One tool to resolve the values is to run <var>DATESTAT DETAIL</var>. Doing some | ||
ad-hoc work with | ad-hoc work with [[SOUL]] is another approach. | ||
The "worse" the quality of a field, the more work is likely to be | The "worse" the quality of a field, the more work is likely to be | ||
required to resolve questions about the field values. | required to resolve questions about the field values.</li> | ||
</ | </ul> | ||
The quality is expressed as <b>pure</b>, | The quality is expressed as <b>pure</b>, <b>good</b>, <b>fair</b>, or <b>poor</b>. It is | ||
<b>good</b>, <b>fair</b>, or <b>poor</b> | |||
an attempt to measure the possible level of effort required to correct | an attempt to measure the possible level of effort required to correct | ||
data on the file. | data on the file. | ||
The terms have the following meanings: | The terms have the following meanings: | ||
<table> | <table class="thJustBold"> | ||
<tr><th>Pure</th><td> | <tr><th>Pure</th> | ||
<tr><th>Poor</th><td> | <td>There are only date values, and all with a single date format.</td></tr> | ||
<tr><th>Fair</th><td> | |||
<tr><th>Good</th><td> | <tr><th>Poor</th> | ||
<td>Either of these: | |||
<ul> | |||
<li>There are more than 10 distinct non-date values and more than .01% of the field occurrences are non-date values.</li> | |||
<li>The percentage of date values that have uncommon formats, times the number of uncommon formats, is greater than .01%, or there are more than 20 date formats. </li> | |||
</ul> | |||
<p> | |||
If there are more than 20 date formats, or if the ratio of occurrences of the most common date format to occurrences of the next most common date format is less than 10 to 1, then <var class="product">Fast/Unload</var> prints the string <code>Common date format not found.</code></p></td></tr> | |||
<tr><th>Fair</th> | |||
<td>Either of these: | |||
<ul> | |||
<li>There are more than 5 distinct non-date values, and more than .001% of the field occurrences are non-date values.</li> | |||
<li>The percentage of date values that have uncommon formats, times the number of uncommon formats, is greater than .001%.</li> | |||
</ul></td></tr> | |||
<tr><th>Good</th> | |||
<td>Anything else.</td></tr> | |||
</table> | </table> | ||
==DATESTAT SUMMARY== | |||
DATESTAT SUMMARY creates a report with | ===DATESTAT SUMMARY=== | ||
<var>DATESTAT SUMMARY</var> creates a report with one to three lines for each date field, in | |||
the following form: | the following form: | ||
<ol> | <ol> | ||
<li><i>ftag format</i> (<i>qual</i>) <i>span</i> | <li><i>ftag format</i> (<i>qual</i>) <i>span</i> | ||
field...<i>n</i>: <i>name</i> | field...<i>n</i>: <i>name</i> </li> | ||
<li><i>tot</i> occurrences of field <i>format count</i> | |||
<li><i>nsmp</i> occurrences of | <li><i>tot</i> occurrences of field <i>format count</i> </li> | ||
<i>sample</i> | |||
<li><i>nsmp</i> occurrences of non-date value (len <i>len</i>): | |||
<i>sample</i> </li> | |||
</ol> | </ol> | ||
<p></p> | <p> | ||
Where: </p> | |||
<ol> | <ol> | ||
<li>This line is always present; the components are: | <li>This line is always present; the components are: | ||
<ul> | <ul> | ||
<li><i>ftag</i> | <li><i>ftag</i> is: "Common date format not found," if there are more than | ||
is | 20 date formats, or it is "Common format:" otherwise. </li> | ||
20 date formats, or "Common format:" otherwise. | |||
<li><i>format</i> is the most commonly occurring date format, | <li><i>format</i> is the most commonly occurring date format, | ||
if <i>ftag</i> is "Common format:", or blank otherwise. | if <i>ftag</i> is "Common format:", or it is blank otherwise. </li> | ||
<li><i>qual</i> is | |||
<li><i>span</i> is either: | <li><i>qual</i> is "pure," "good," "fair," or "poor." </li> | ||
< | |||
<li><i>span</i> is either of these: | |||
< | <ul> | ||
<li><code>CENTSPAN: <i>YYYY</i></code></li> | |||
<li>No <i>YY</i> occurrences</li> | |||
<li><i> | |||
<li> | |||
</ul> | </ul> | ||
<li>This line is always printed if the <i>qual</i> | <p> | ||
is not <b>pure</b>. | The former gives a recommended <var>CENTSPAN</var>, | ||
if the field has any two-digit years, where <i>YYYY</i> | |||
is the oldest two-digit year found. | |||
The latter occurs if the field does not have any two-digit years. | |||
Two-digit years are interpreted using a <var>CENTSPAN</var> of <code>1900</code>, or, if running <var>DATESTAT</var> after 1999, a <var>CENTSPAN</var> of <code>-99</code>. </p></li> | |||
<li><i>n</i> is a sequential numbering of the fields; this can be correlated to | |||
an <var>FSTATS</var> report. </li> | |||
<li><i>name</i> is the name of the field. </li> | |||
</ul> </li> | |||
<li>This line is always printed if the <i>qual</i> is not <b>pure</b>. | |||
The components are: | The components are: | ||
<ul> | <ul> | ||
<li><i>tot</i> | <li><i>tot</i> is the total number of occurrences of the field in the file. </li> | ||
is the total number of occurrences of the field in the file | |||
<li><i>format count</i> | <li><i>format count</i> is: <i>n</i> <b>different formats found</b> (if | ||
is | <i>n</i> is more than 1), or it is blank otherwise. </li> | ||
<i>n</i> is more than 1, or blank otherwise. | </ul> </li> | ||
</ul> | |||
<li>This line is always printed if there are any | <li>This line is always printed if there are any non-date values for the field. | ||
The line has one of the two following forms: | The line has one of the two following forms: | ||
< | <ul> | ||
<li><i>nsmp</i> occurrences of | <li><i>nsmp</i> occurrences of non-date value (<i>len</i>): | ||
<i>truncated_value</i> <i>trunc_flag</i> | <i>truncated_value</i> <i>trunc_flag</i> </li> | ||
<li><i>dist</i> distinct | |||
</ | <li><i>dist</i> distinct non-date values </li> | ||
< | </ul> | ||
Where <i>nsmp</i> is the number of occurrences of a | <p> | ||
to be exactly one | Where <i>nsmp</i> is the number of occurrences of a non-date value, if it appears | ||
to be exactly one non-date value in the field; <i>len</i> is its length; and | |||
<i>truncated_value</i> is the first 22 characters of its value. | <i>truncated_value</i> is the first 22 characters of its value. | ||
<i> | <i>trunc_flag</i> is the first 22 bytes of its value, if the length is greater than 22. | ||
In fact, some of the non-date values in the field may differ, if their first 22 characters are | |||
In fact, some of the | the same.</p> | ||
<p> | |||
the same. | If there are two or more non-date values in the field, the second form of this line is presented, where <i>dist</i> is the number of different non-date values that Fast Unload has kept as samples (up to the maximum of 20).</p> </li> | ||
<p>< | |||
If there are | |||
second form of this line is presented, where <i>dist</i> | |||
is the number of different | |||
</ol> | </ol> | ||
<p></ | <p class="note"><b>Note:</b> For the recommended <var>CENTSPAN</var>, | ||
<var class="product">Fast/Unload</var> assumes that all two-digit years occur in the 1900s, or, if running <var>DATESTAT</var> after 1999, 99 years before | |||
<var class="product">Fast/Unload</var> assumes that all | the date of the run up to and including the date of the run. </p> | ||
DATESTAT after 1999, 99 years before | |||
the date of the run up to and including the date of the run. | ===DATESTAT DETAIL=== | ||
==DATESTAT DETAIL== | <var>DATESTAT DETAIL</var> creates a report with one page for each date field, which | ||
includes: | includes: | ||
<ol> | <ol> | ||
<li>The number of field occurrences. | <li>The number of field occurrences. </li> | ||
<li>The number of date occurrences, with the minimum and maximum date | <li>The number of date occurrences, with the minimum and maximum date | ||
value, for both | value, for both four-digit ("YYYY") and two-digit ("YY") years. </li> | ||
<li>A sample of the discovered date formats. | |||
<li>A sample of the year values occurring in the field. | <li>A sample of the discovered date formats. </li> | ||
<li>A sample of the non-date values occurring in the field. | |||
<li>A sample of the year values occurring in the field. </li> | |||
<li>A sample of the non-date values occurring in the field. </li> | |||
</ol> | </ol> | ||
< | <p> | ||
The sampling rules are described in [[ | The sampling rules are described in [[#DATESTAT sampling|DATESTAT sampling]]; some values | ||
may be discarded. | may be discarded. An asterisk (<tt>*</tt>) is printed after a sample if | ||
An asterisk (< | the sample occurrence count is incomplete.</p> | ||
the sample occurrence count is incomplete. | <blockquote class="note"> | ||
<p>< | <b>Note:</b> For the recommended <var>CENTSPAN</var> and the minimum/maximum values of the <var>DETAIL</var> report, <var class="product">Fast/Unload</var> assumes that all two-digit years occur either: | ||
Note | <ul> | ||
CENTSPAN and the | <li>In the 1900s</li> | ||
<var class="product">Fast/Unload</var> assumes that all | <li>99 years before the date of the run (up to and including the date of the run), <b><i>if</i></b> running <var>DATESTAT</var> after 1999 </li> | ||
</ul> | |||
the date of the run up to and including the date of the run | </blockquote> | ||
==See also== | ==See also== | ||
{{Template:Fast/Unload topic list}} |
Latest revision as of 22:45, 4 August 2016
DATESTAT sampling
If the DATESTAT statement is present in the FUNIN
data set,
Fast/Unload will determine which fields in the file contain date
values.
The determination of whether a field contains a date value is done in two passes:
- The first pass examines 1000 evenly-distributed records in the file.
- The second pass analyzes all records.
At the end of each pass, certain fields are retained for date analysis and reporting. A field is retained if:
- For pass one, the field is not found in the 1000 records.
- At least one field value that conforms to a date format is found in the pass being performed, except:
A field is not a date if all dates were all numeric, there were more than 5 distinct non-date values, and fewer than 50% of the instances of the field are dates. This is an attempt to avoid situations in which a field such as zip code could be treated as a 5-digit (YYDDD) date.
- The field is not a Blob field. Such fields are not candidates for DATESTAT analysis.
The analysis keeps tracks of various totals that are used in reporting, and also keeps some of the values found in the field. Up to 36 date formats, 20 non-date values, and 120 year-samples will be kept. If these overflow, the least-recently-found year and non-date samples are discarded, and the most-recently-found date format sample is discarded.
Note: Since only as many as 22 characters of non-date values are kept, two different values that are the same in the first 22 characters are considered to be the same value.
The date formats that are searched are listed below, in somewhat arbitrary groupings for visual distinction. Each item in the list represents two formats:
- Format one is the terms that precede the square brackets.
- Format two is format one plus the immediately following square-bracketed contents.
MON DAY YY[ HH:MI:SS] MON DAY YYYY[ HH:MI:SS] DAY MON YY[ HH:MI:SS] DAY MON YYYY[ HH:MI:SS] DAY MONTH YY[ HH:MI:SS] DAY MONTH YYYY[ HH:MI:SS] DAY MON, YY[ HH:MI:SS] DAY MON, YYYY[ HH:MI:SS] DAY MONTH, YY[ HH:MI:SS] DAY MONTH, YYYY[ HH:MI:SS] DDMMYY[HHMISS] DDIMMIYY[HHMISS] DDMMYYYY[HHMISS] DDIMMIYYYY[HHMISS] MONTH DAY YY[ HH:MI:SS] MONTH DAY YYYY[ HH:MI:SS] MMDDYY[HHMISS] MMIDDIYY[HHMISS] MMDDYYYY[HHMISS] MMIDDIYYYY[HHMISS] YYDDD[HHMISS] YY MON DAY[ HH:MI:SS] YY MONTH DAY[ HH:MI:SS] YYMMDD[HHMISS] YYIMMIDD[HHMISS] YYYYDDD[HHMISS] YYYY MON DAY[ HH:MI:SS] YYYY MONTH DAY[ HH:MI:SS] YYYYMMDD[HHMISS] YYYYIMMIDD[HHMISS]
For the meaning of the components of these formats, see Datetime formats.
DATESTAT reporting
This section describes the reports created by the DATESTAT SUMMARY and DATESTAT DETAIL statements. Both of these statements indicate something about the "quality" of the date data in the field.
The purposes of "date field quality" are:
- If you are running DATESTAT SUMMARY, and you have some values other than pure, you may want to do further investigation of the date fields.
- To indicate how much work might be involved to resolve the various values stored in the field. One tool to resolve the values is to run DATESTAT DETAIL. Doing some ad-hoc work with SOUL is another approach. The "worse" the quality of a field, the more work is likely to be required to resolve questions about the field values.
The quality is expressed as pure, good, fair, or poor. It is an attempt to measure the possible level of effort required to correct data on the file. The terms have the following meanings:
Pure | There are only date values, and all with a single date format. |
---|---|
Poor | Either of these:
If there are more than 20 date formats, or if the ratio of occurrences of the most common date format to occurrences of the next most common date format is less than 10 to 1, then Fast/Unload prints the string |
Fair | Either of these:
|
Good | Anything else. |
DATESTAT SUMMARY
DATESTAT SUMMARY creates a report with one to three lines for each date field, in the following form:
- ftag format (qual) span field...n: name
- tot occurrences of field format count
- nsmp occurrences of non-date value (len len): sample
Where:
- This line is always present; the components are:
- ftag is: "Common date format not found," if there are more than 20 date formats, or it is "Common format:" otherwise.
- format is the most commonly occurring date format, if ftag is "Common format:", or it is blank otherwise.
- qual is "pure," "good," "fair," or "poor."
- span is either of these:
CENTSPAN: YYYY
- No YY occurrences
The former gives a recommended CENTSPAN, if the field has any two-digit years, where YYYY is the oldest two-digit year found. The latter occurs if the field does not have any two-digit years. Two-digit years are interpreted using a CENTSPAN of
1900
, or, if running DATESTAT after 1999, a CENTSPAN of-99
. - n is a sequential numbering of the fields; this can be correlated to an FSTATS report.
- name is the name of the field.
- This line is always printed if the qual is not pure.
The components are:
- tot is the total number of occurrences of the field in the file.
- format count is: n different formats found (if n is more than 1), or it is blank otherwise.
- This line is always printed if there are any non-date values for the field.
The line has one of the two following forms:
- nsmp occurrences of non-date value (len): truncated_value trunc_flag
- dist distinct non-date values
Where nsmp is the number of occurrences of a non-date value, if it appears to be exactly one non-date value in the field; len is its length; and truncated_value is the first 22 characters of its value. trunc_flag is the first 22 bytes of its value, if the length is greater than 22. In fact, some of the non-date values in the field may differ, if their first 22 characters are the same.
If there are two or more non-date values in the field, the second form of this line is presented, where dist is the number of different non-date values that Fast Unload has kept as samples (up to the maximum of 20).
Note: For the recommended CENTSPAN, Fast/Unload assumes that all two-digit years occur in the 1900s, or, if running DATESTAT after 1999, 99 years before the date of the run up to and including the date of the run.
DATESTAT DETAIL
DATESTAT DETAIL creates a report with one page for each date field, which includes:
- The number of field occurrences.
- The number of date occurrences, with the minimum and maximum date value, for both four-digit ("YYYY") and two-digit ("YY") years.
- A sample of the discovered date formats.
- A sample of the year values occurring in the field.
- A sample of the non-date values occurring in the field.
The sampling rules are described in DATESTAT sampling; some values may be discarded. An asterisk (*) is printed after a sample if the sample occurrence count is incomplete.
Note: For the recommended CENTSPAN and the minimum/maximum values of the DETAIL report, Fast/Unload assumes that all two-digit years occur either:
- In the 1900s
- 99 years before the date of the run (up to and including the date of the run), if running DATESTAT after 1999
See also
- Fast/Unload overview
- Fast/Unload invocation
- Fast/Unload program parameters
- Fast/Unload Extraction Language (FUEL)
- Fast/Unload standard #functions
- Fast/Unload BLOB/CLOB processing considerations
- Fast/Unload datetime processing considerations
- Fast/Unload DATESTAT analysis
- Fast/Unload job statistics
- Fast/Unload SOUL Interface
- Fast/Unload with an external sort package
- Fast/Unload with DBCS data
- Fast/Unload customer-written assembler #function packages
- Fast/Unload user exits or filters
- Fast/Unload with Model 204 file groups
- Fast/Unload with Model 204 fieldgroups
- Fast/Unload with the Sir2000 Field Migration Facility
- Fast/Unload floating point arithmetic and numeric conversion
- Fast/Unload program return codes
- Fast/Unload installation
- Fast/Unload customization of defaults
- Fast/Unload SMF record format
- Fast/Unload release notes
- Fast/Unload messages