Skip to content

Label schema

This document describes the label schema implemented in mast_transfer_tools.labels and supporting code.

Mismatches between the implementation and this document should be considered bugs.

keyword description YAML type required?
dataset identifier for dataset string yes
delivery_id identifier for delivery within dataset string | integer yes
time information on temporal boundaries of delivery mapping yes
time/observation_start_date start date of observations date (normalized to YYYY-MM-DD for string comparison; time of day not allowed) no
time/observation_end_date end date of observations date (normalized to YYYY-MM-DD for string comparison; time of day not allowed) no
time/delivery_start_date date on which overall delivery process was initiated (web form interaction, not file upload) date (normalized to YYYY-MM-DD for string comparison; time of day not allowed) yes
contacts contact information for associated personnel mapping no (although it will make automated notifications more difficult)
contacts/provider email address for provider-side point-of-contact sequence of strings (valid emails) no
contacts/archive email address for archive-side point-of-contact sequence of strings (valid emails) no
filetypes structure for defining filetypes mapping no (although we’ll be doing very little validation if no filetypes are defined)
filetypes/$NAME individual filetype definition mapping at least one
filetypes/$NAME/standard File format. Data-level validation is currently only supported for ASDF, FITS, and Parquet files. This field is recommended for all files, however. For ASDF / FITS / Parquet files, this field should contain, respectively, “asdf”, “fits”, or “parquet”. For other file formats, this field may contain either a file extension; ordinary-language format name / description; or MIME type. Specificity is recommended (e.g., “pdf” clearly denotes a Adobe Acrobat file, but “dat” does not clearly denote any particular file format). This field is case-insensitive. Note: the validator will perform no data-level validation if this field is not defined. string required when objects is defined; otherwise recommended.
filetypes/$NAME/filename Regex pattern(s) which indicate what files are covered by this filetype. Patterns are matched against pathnames relative to the top-level directory of the file set (normally this is the directory containing the label file) and must match the entire pathname. Regex patterns that begin with (?!) (which normally would make the entire pattern never match) are interpreted as exclusions, e.g. [‘.*\.fits’, ‘(?!).*/not-this-one\.fits’] means this filetype covers all files whose name ends with .fits except for not-this-one.fits. string (valid Python-style regex) or list of strings yes
filetypes/$NAME/ignore If true, files matching the ‘filename’ pattern(s) are ignored; they are not processed for validation and they are not uploaded either. When true, standard, objects, and validation_options should all be omitted. boolean no (default false)
filetypes/$NAME/objects data objects in file we would like to perform validation on, or at least know about. sequence of mappings as described below no (but if you want to do anything but basic format validation, you need this)
filetypes/$NAME/objects/$ELEMENT structure giving information on an individual data object. For FITS files, this represents an HDU; for ASDF files, it represents a node; for Parquet files, it represents the (single) table. mapping For FITS, must contain definitions for all HDUs in physical order. For ASDF, may contain definitions for any non-empty subset of nodes. For Parquet, must contain exactly one object, representing the table as a whole.
filetypes/$NAME/objects/$ELEMENT/objtype identifier for object type. In FITS files, this shall be “primary” for the primary HDU, and the XTENSION of most extension HDUs: “bintable” or “image”. Bintable extension HDUs containing tile-compressed images, however, shall be specified as “compimage”, as their logical representation differs substantially from a standard binary table. ASCII table extensions are not currently supported. For ASDF, this value shall be the type name of the in-memory Python representation of the node. It may be either (1) a fully-qualified type name (e.g. numpy.ndarray) (2) the simple name of the type (e.g. ndarray), or (3) a recognized alias (e.g. “ndarray” for asdf.tags.core.NDArrayType). Note that using the simple name of a type can result in ambiguity, notably between pyarrow.lib.Table and astropy.table.table.Table. This field is case-insensitive. For Parquet, this field may only have the value “table” (but need not be defined). string yes, for FITS and ASDF; optional for Parquet
filetypes/$NAME/objects/$ELEMENT/name object name. For FITS HDUs, it may be defined as either the EXTNAME of the HDU (case-insensitive), or “primary” for the primary HDU. For non-primary HDUs with no EXTNAME, it shall not be defined. For ASDF nodes, it shall be defined either as (1) the node name or (2) the fully-qualified path to the node expressed as a sequence of strings, integers, or booleans. Unless ‘repeated’ is true, the node must either (1) be unique within the file or (2) be at the top level of the tree. Note that ASDF node paths may legally use boolean keys (true/false). These are permitted for fully-qualified paths. This field may be defined for objects in Parquet files but shall be ignored by validators. string for FITS files; string integer for ASDF files; or for fully-qualified ASDF paths, sequence of string
filetypes/$NAME/objects/$ELEMENT/name_regex interpret ‘name’ field (or each element of ‘name’, for fully-qualified ASDF paths) as regex? Note: does not permit wildcard numerical indexing on nested lists of ASDF nodes. boolean no (default false)
filetypes/$NAME/objects/$ELEMENT/repeated may an arbitrary number of similarly-formatted objects matching ‘name’ be present in the file? (must be contiguous, if FITS). Requires ‘name’ to be defined, and implies name_regex \= true. NOTE: ‘extra’ undefined objects are normally valid in ASDF files (to avoid the need to express the entire schema in the label). However, in order to avoid ambiguity, if an object is defined and its name specification might match multiple similar items in a valid example of this filetype, ‘repeated’ must be true. boolean no (default false)
filetypes/$NAME/objects/$ELEMENT/optional is it ok if this object isn’t present in a particular file? boolean no (default false)
filetypes/$NAME/objects/$ELEMENT/dtype scalar or array data type string (enum of valid data type codes) no (allowed only for scalars and arrays)
filetypes/$NAME/objects/$ELEMENT/ndim number of dimensions of array integer no (allowed only for arrays)
filetypes/$NAME/objects/$ELEMENT/value If the ‘value_regex’ field of this YAML mapping is absent or false, this gives the value of the in-memory Python object (under standard Python equality) after interpreting this field as a Python literal. If the ‘value_regex’ field of this YAML mapping is true, a regular expression to be matched against the string representation of the in-memory Python object. Only valid for nodes of an ASDF tree that are of simple Python types, as implied by the valid YAML types of this field. string | integer | float | boolean | (sequence of string | integer | boolean | float) | (mapping of string | integer | boolean | null to string | integer | float | boolean | null) if ‘value_regex’ is false; string if ‘value_regex’ is true no (allowed only for ASDF)
filetypes/$NAME/objects/$ELEMENT/value_regex Should ‘value’ as described above be interpreted as a regex to be matched against the string representation of this object? boolean no (default false)
filetypes/$NAME/objects/$ELEMENT/metadata Constraints on object metadata. For Parquet tables, elements of this mapping refer to user-defined key-value pairs in the file footer. For FITS HDUs, they refer to records in the HDU’s header (which must have unique keywords). This field is not supported for ASDF nodes (most ASDF metadata are distinct nodes, so should be defined as distinct data objects). mapping whose keys are metadata keys/keywords and whose values are mappings as described below no (and, if nonempty, allowed only for FITS and Parquet)
filetypes/$NAME/objects/$ELEMENT/metadata/$KEY mapping describing value associated with $KEY, as defined below mapping no
filetypes/$NAME/objects/$ELEMENT/metadata/$KEY/value Similar to filetypes/$NAME/objects/$ELEMENT/value, but supports a smaller set of types and refers to metadata fields rather than ASDF nodes. Notes: 1. Unlike FITS header values, Parquet user-defined metadata values have no type system. As such, if value_regex is False, validators shall attempt to coerce them to the Python type corresponding to the YAML type of this field (with special coercion rules for boolean/bool and complex types as described below). If coercion fails, the comparison will also fail. 2. If value_regex \= false, validators shall interpret the FITS integer, complex integer, complex floating-point, floating-point, logical, and character string types as corresponding Python types (respectively: int, complex, complex, float, bool, str). Validators shall treat equality comparison as invalid if the value of this field does not have the corresponding YAML type (see below for special rules for complex). 3. Validators shall coerce YAML datetimes to strs containing their full ISO-8601 representation. Label writers may prevent this by strictly specifying values as YAML strings (wrapping them in single or double quotes). Validators shall not provide special coercion/reformatting for the FITS datetime pseudo-type. 4. YAML has no complex number type. When validating FITS files, validators shall coerce YAML strings of the format (n+mj) or (n-mj), e.g. “(1+2j)” to complex. Validators shall not coerce Parquet metadata values to complex. Label writers who wish to perform direct equality comparison to Parquet “complex” metadata values should use string comparison to whatever format is used in the file type. 5. If this field is boolean, validators shall interpret the following (case-insensitive) Parquet metadata values as True/False: ‘true’, ‘t’, ‘1’, ‘yes’, ‘on’ / ‘false’, ‘f’, ‘0’, ‘no’, ‘off’. string | integer | float | boolean; or, if value_regex is true, string no (but if not defined, just checks for presence of the metadata key)
filetypes/$NAME/objects/$ELEMENT/metadata/$KEY/value_regex Same as filetypes/$NAME/objects/$ELEMENT/value_regex, but for metadata fields rather than ASDF nodes boolean no (default false)
filetypes/$NAME/objects/$ELEMENT/metadata/$KEY/index For keywords that appear multiple times in a FITS header, specifies which of these (0-indexed) to check. Validation will fail if a keyword is present multiple times and this value is not defined. Not permitted for Parquet (per Parquet standard, user-defined metadata keys must be unique). integer no (and permitted only for FITS)
filetypes/$NAME/objects/$ELEMENT/schema Schema definition – data types and field / column names – for table or struct-like objects. If this field is defined, all fields / columns of the object must be described. Note that this should not be used for FITS BINTABLE extensions containing compressed images. Use object-level “ndim” and “dtype” values corresponding to their decompressed representation instead. sequence of mappings no (if nonempty, allowed only for tables or struct-like objects)
filetypes/$NAME/objects/$ELEMENT/schema/$ELEMENT extended column description mapping no
filetypes/$NAME/objects/$ELEMENT/schema/$ELEMENT/name name of column (optionally as regex) string yes
filetypes/$NAME/objects/$ELEMENT/schema/$ELEMENT/dtype data type of column string (enum of valid data type codes) yes
filetypes/$NAME/objects/$ELEMENT/schema/$ELEMENT/name_regex treat this column’s name field as regex? boolean no (default false)
filetypes/$NAME/objects/$ELEMENT/schema/$ELEMENT/repeated may an arbitrary number of columns matching this specification be present in the schema? (must be contiguous). implies name_regex \= true boolean no (default false)
filetypes/$NAME/objects/$ELEMENT/schema/$ELEMENT/ndim dimensionality of column (for things like embedded arrays in FITS binary table HDUs). 0 means each row contains a scalar, 1 means each row contains a 1D array, etc. Note: ndim > 0 is not supported for list types in Parquet files. use dtype ‘O’ and ndim 0. integer no (default 0)
filetypes/$NAME/validation_options structure giving filetype-specific validation instructions mapping no
filetypes/$NAME/validation_options/skip specific validation steps to skip sequence of strings; elements must be names of validation steps known to the pipeline. [‘all’] will skip all (meta)data-level validation for this filetype. no (default empty sequence, i.e. do everything we can do with the information we have)
filetypes/$NAME/validation_options/object_check_hook Name of module containing a custom check_file() function for additional file validation. string no
delivery_meta structure giving metadata about the validator and the label itself, including global processing directives mapping yes
delivery_meta/schema_version version of schema under which this label was created string (semver) yes
delivery_meta/global_validation_options processing directives applicable to all filetypes mapping no
delivery_meta/global_validation_options/skip like filetypes/$NAME/validation_options/skip, but for all filetypes sequence of strings (if it contains ‘all’, skips all (meta)data-level validation) no (default empty list, i.e. do everything we can do with the information we have)
delivery_meta/global_validation_options/missing_filetypes_ok If false (default), a validator shall halt and report an error condition if the provider-uploaded index does not contain at least one example of each filetype defined in the label. boolean no (default false)
delivery_meta/global_validation_options/no_assigned_filetype_ok If false (default), a validator shall halt and report an error condition if a provider-uploaded index contains one or more files that do not match any filetype defined in the label. boolean no (default false)
custom_metadata This is a special area for user-defined metadata. All entries in this mapping are legal as long as they are valid YAML. mapping no

Additional notes

Python Types

Unless otherwise specified in a particular context, validators shall interpret YAML types as the following Python types:

  • date: datetime.date
  • when compared to str, convert the date to YYYY-MM-DD form (ISO 8601 “extended” date representation) first
  • integer: int
  • float: float
  • string: str
  • except: if compared to complex and in the format (n+mj) or (n-mj) (e.g. (1+2j)), shall be cast to complex
  • null: NoneType (i.e., the singleton None)
  • boolean: bool
  • sequence: list
  • mapping: dict

Unsupported Features

  • nonstandard types
  • unordered sets
  • complex mappings

Data Type Codes for Arraylike and Tablelike Objects

Note: not all codes permissible according to this label schema are valid according to all supported externally-defined standards. It is recommended but not required that label validators attempt to identify standards-invalid data types and treat labels that specify them as invalid. This is because labels containing such codes describe files that cannot legally exist, so data-level validation against such labels will always fail. Some (non-exhaustive) examples are given below.

  • f8: 8-byte floating-point / “double-precision”
  • f4: 4-byte floating-point / “single-precision”
  • f2: 2-byte floating-point / “half-float”
  • not supported by the FITS standard
  • i8: 8-byte signed integer
  • i4: 4-byte signed integer
  • i2: 2-byte signed integer
  • i1: 1-byte signed integer / “signed byte”
  • not supported by the FITS standard
  • u8: 8-byte unsigned integer
  • u4: 4-byte unsigned integer
  • u2: 2-byte unsigned integer
  • u1: 1-byte unsigned integer / “unsigned byte”
  • c16: 16-byte complex
  • c8: 8-byte complex
  • Vn: catchall for n-byte fixed-width binary data. Must specify ‘n’, e.g. ‘V5’ for a 5-byte field. Includes all fixed-width types not interpretable as another specified type, and, specifically:
  • FITS table character array fields (‘A’ TFORM)
  • Parquet FIXED_LEN_BYTE_ARRAY
  • numpy fixed-width string and (scalar) void types
  • b1: boolean / logical
  • supported by the FITS standard only in tables
  • O: catchall for variable-length data or pointer to variable-length data, including:
  • serialized or autogenerated Python objects of non-fixed-width types
  • Parquet BYTE_ARRAY
  • variable-length array descriptors in FITS binary tables (‘P’ and ‘Q’ TFORMs)
  • M8[precision]: Specialized timestamp types, including Parquet datetime logical types and serialized numpy/pandas datetime types.
  • presently we only support 64-bit timestamps. Timestamps require a precision tag: Y(ears), M(onths), W(eeks), D(ays), h(ours), m(inutes), s(econds), or s(econds) with a standard SI fractional scale prefix: ms (milliseconds), us (microseconds), ns, ps, fs, as. Any of these can have a numeric prefix; for example, if the actual clock granularity of some data recorder was 250 microseconds, that could be put in an 'M8[250us]' numpy array.
  • This does not include things like character arrays used to hold ISO timestamp strings. Use O or Vn as appropriate for those types.
  • m8[precision]: Specialized timedelta types.
  • Notes on timestamp precision tags also apply to timedeltas.

Unsupported Type Descriptions

The schema permits the following data types, but does not support specific description of them:

  • Parquet logical types other than those mentioned above. These include lists, dictionaries, structs, durations, variable-length strings, embedded JSON documents, geography, decimals, UUIDs, maps, and many more. Fields containing such logical types shall use the code corresponding to the logical type’s physical type:
  • BYTE_ARRAY: O
  • FIXED_LEN_BYTE_ARRAY: Vn (where n is the fixed length)
  • FLOAT: f4
  • DOUBLE: f8
  • INT32: i4
  • INT64: i8
  • BOOLEAN: b1
  • Python types without fixed byte widths, including int, float, list, str, datetime.datetime, pandas Categorical and nillable integers, etc. Fields of such types shall be described as ‘O’.
  • Bit fields in FITS binary tables (TFORM ‘X’). Fields of such types shall be described as ‘Vn’, where n is the total number of bytes in the bit field (which must be an integer per FITS standard).

The schema also does not support explicit per-column description of the following type-adjacent features (although label-writers may, when they appear in supported metadata, specify them as metadata values if desired):

  • physical units (e.g. as properties of Columns of Astropy Tables)
  • byte order. LSB and MSB realizations of data types shall be treated as identical by validators, and assumed to follow the byte order required by the externally-defined standard where relevant (e.g. MSB for FITS).
  • display format (e.g. C format specifiers embedded in metadata or the FITS TDISPn keyword).

Unsupported Data Formats

  • All floating-point data types must be compliant with IEEE 754. Legacy floating-point formats (e.g. VAX, IBM) are not supported.
  • All signed integer data types must be implemented in standard two’s-complement fashion. Other styles of signed integer are not supported.
  • Parquet table columns of the deprecated INT96 physical type are supported only when assigned a datetime logical type.
  • FITS text table HDUs are not supported.