Configuration and customization
Configuration values
Parameter store
Network config
These values are used during transfer by the validation pipeline, upload init
lambda, and upload client. They must be defined as key-value pairs in a JSON
object in the AWS Parameter Store parameter given by
mast_transfer_tools.config.NETWORK_CONFIG_PARAMETER. The expected format of
the decoded dict is defined in mast_transfer_tools.types.PipelineNetworkConfig.
Values are:
AVAILABILITY_ZONE_ID: Short AWS AZ ID for directory buckets, e.g. 'use1-az4'.BUCKET_STEM: 'namespace' prefix for bucket names, e.g. 'my-fast-deployment-buckets-99'CONFIG_BUCKET: name of shared config bucket, e.g. 'fast-config-bucket-99999'INIT_LAMBDA_ARN: ARN of init lambda, e.g. 'arn:aws:lambda:us-east-1:123456789012:function:fast-upload-init-lambda'LOCK_STALENESS_THRESHOLD: Time, in seconds, after which a non-updated lock file will be treated as 'stale', e.g. 3600TASK_CONFIG_PREFIX: prefix in config bucket under which task configurations are stored, e.g. '/tasks'
Resource tags
These values are used only by the upload init lambda to assign tags to the
validation pipeline task. They must be defined as key-value pairs in a JSON
object in the AWS Parameter Store parameter given by
mast_transfer_tools.config.RESOURCE_TAG_PARAMETER. They must be valid
key-value pairs for AWS resource tags.
In-library configuration
These values are defined in mast_transfer_tools.config.
LABEL_PREFIX: prefix in config bucket under which "official" versions of labels may be foundMAX_TRANSFER_FAILURES: Total number of failures possible during transfer before upload client shuts itself downNETWORK_CONFIG_PARAMETER: AWS Parameter Store URL for network config values (see above)RESOURCE_TAG_PARAMETER: AWS Parameter Store URL for task resource tags (see above)VAL_PIPE_SETTINGS: Adictgiving runtime thresholds for the validation pipeline. Values are:keepalive_threshold(float): number of seconds after which the pipeline will, if it has written no other messages to the log object, write a keepalive messagen_val_threads(int): number of threads used by the validation servertransfer_timeout(float): number of seconds after which, if no new messages have been written by the upload client, the validation pipeline will shut downmissing_timeout(float): number of seconds after which, if no new messages have been written by the upload client, the validation pipeline will log a missing event but not yet shut downloop_rate(float): seconds to delay between iterations of primary update loop
COGCONFIG: amast_transfer_tools.types.CognitoConfigurationobject giving domain, client_id, redirect_uri, region, user_pool_id, and identity_pool_id for upload client Cognito transactions.LAMBDA_CLIENT_CONFIG: abotocore.config.Configobject used for upload client Lambda calls. Legal values are any legal values forConfig, but high read timeout (for awaiting pipeline launch) and restricting max attempts to 1 (to prevent spurious multiple invocations) are strongly recommended.VAL_PIPE_SQS_QUEUE_URL: URL for SQS queue used by the validation pipeline to send reports.
Validation pipeline configuration
The validation pipeline is intended to run as an ECS task on Fargate. The
task and the container it runs are "black boxes" from the perspective of the
library as a whole, but the default assumption is that the container will
include the fast-upload library and all its dependencies, and will run a
server.core.ValidationSession on launch.
Task customization
Task definitions, clusters, VPCs, and security groups must be created at AWS
level. However, which of these to use may be specified at runtime via
YAML-formatted text objects in the configuration bucket under TASK_CONFIG_PREFIX.
A default task definition ("default-task-config.yaml") must always exist.
Dataset-specific task definitions ("$DATASET-task-config.yaml") may also be defined.
Values in those files override values in default-task-config.yaml. (Delivery-specific
task definitions are not supported.)
Valid parameters in task configuration objects are:
cluster: short name or full ARN of cluster to run task onfamily: family / revision or full ARN of task definitionsubnet_id: VPC subnet idsg_id: Security group id
Users are most likely to want to override family. An alternate task definition
could, for instance, specify a different container (perhaps including modules
supporting custom script hooks) or additional memory for the task (for data-level
validation on datasets with very large files).
Building and customizing the validation container
The library includes a Dockerfile (Dockerfile.valpipe) describing a minimal configuration for a working validation pipeline container, along with minimal entrypoint (valpipe_entrypoint.sh) and handler (pipe_entry.py) scripts for the validation pipeline itself.
For some datasets, it may be desirable to add additional software to the container, modify the entrypoint script, etc. Example use cases include adding libraries for a dataset's ASDF schema or modules for custom script hooks.
Specific build steps depend on the task definition, but a typical command-line workflow from repository root might look like:
docker build -t aws-mast-fast-valpipe:latest -f Dockerfile.valpipe .
docker tag aws-mast-fast-valpipe:latest 999999999999.dkr.ecr.us-east-1.amazonaws.com/aws-mast-fast-valpipe:latest
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 999999999999.dkr.ecr.us-east-1.amazonaws.com
docker push 999999999999.dkr.ecr.us-east-1.amazonaws.com/aws-mast-fast-valpipe:latest
Where aws-mast-fast-valpipe is the associated ECR respository name, 999999999999 is the ID
of the AWS account that owns the ECR repository, and us-east-1 is the region in which the ECR repository lives.
Customizing data-level validation
By default, the validator performs data-level validation on files whose standards support
data validation (ASDF, FITS, and Parquet). This always includes simple validation of
conformance to standard. If objects is defined and any of the following properties are
given for a DataObject, the validator also checks that property (note that not all properties
are legal/relevant for all standards and object types; see the label schema definition for a full description):
- name
- objtype
- ndim
- dtype
- value
- schema
- metadata
Adding custom check hooks
Custom validation behaviors may be added without modifying core library code by defining
the object_check_hook parameter of a filetype. The value of object_check_hook should
be the fully-qualified name of a Python module containing a check_file() function.
check_file() is expected to have the signature
(data: pyarrow.parquet.ParquetFile | asdf.AsdfFile | astropy.fits.HDUList, spec: mast_transfer_tools.labels.Filetype) -> failures: dict.
The specific type of data is determined by the file standard of the filetype. failures should be empty if the
custom check passes, and should contain one or more key/value pairs describing failures if not. The specific
format of failures is up to the checker, but all keys and values should be transparently YAML-serializable,
and string/string pairs are recommended. They will be included in failure messages under the key
hook:HOOK_MODULE_NAME.
If the module specified in object_check_hook cannot be imported in the validation environment,
or if it does not have a check_file() function, the validator will treat it as an error. Note that
in order for local and remote validation to behave identically, users must ensure that data providers
and the validation pipeline are using the same version of all such modules.
This check is in addition to the basic standard check and any checks triggered by properties defined in objects.
Skipping validation steps
It is possible to skip various parts of data-level validation by populating the skip field of a
filetype's validation_options. skip is a list of strings. Meaningful values are:
- names of individual built-in property checks (name, objtype, ndim, dtype, value, schema)
- 'standard' skips basic standard validation; however, this is only meaningful if no other checks that require data-level validation are performed. This is because the validator must be able to interpret the file as a valid representative of its standard to perform any other checks.
- 'hook' skips the custom
object_check_hook - 'all' skips all data-level validation