State
FileIndex
module-attribute
FileIndex: TypeAlias = pd.DataFrame
Alias for a DataFrame created from a file index and a dataset label. It has the following columns:
- path: path to file from bucket / tree root.
- filename: name of the file
- type: file type (if any) as inferred from filename, based on definitions in label
- state: general state of the file; legal values are the members of FileState.
- n_fail: number of failed upload attempts as reported by client
FileState
module-attribute
FileState = Literal['waiting', 'validated', 'queued', 'invalid', 'error', 'missing']
Codes for validation states of a file.
Meanings are:
- waiting: Not in bucket, client does not claim to have uploaded it
- validated: In bucket, passed validation
- queued: In bucket, pending validation
- invalid: In bucket, failed validation
- missing: Not in bucket, client claims to have uploaded it
ValidationState
ValidationState(index: FileIndex, label: Label, reader: S3TSVReader, transfer_timeout: float = 600, missing_timeout: float = 240, max_failures: int = 10)
The validation pipeline's model of the current state of the transfer.
Attributes are described inline at the bottom of the class definition.
Should typically only be used by fast.validation.ValidationSession.
Design note
This class is primarily 'informational'. It doesn't need to be a bare namespace--parsing and munging methods are acceptable-- but it should not sprout methods that execute pipeline tasks.
client_absent
class-attribute
instance-attribute
client_absent: bool = False
True if the client application didn't tell us it stopped, but it's stopped talking for longer than transfer_timeout
client_missing
class-attribute
instance-attribute
client_missing: bool = False
True if the client application didn't tell us it stopped, but it's stopped talking for longer than missing_timeout
client_on
property
client_on: bool | None
True if we believe the client application is running, False if not, None if we don't think we know. This is directly derived from client_absent and client_status.
client_status
class-attribute
instance-attribute
client_status: ClientStatus = 'unchecked'
Our belief about the general condition of the client application, as based strictly on what it has reported and not reported. Our timeouts are a backup for cases in which the client stops working but is unable to log this fact (e.g. logging bug, power outage, OS-level kill). This attribute, together with client_absent and session_invalid, tells us whether we should shut down the validation pipeline.
done
property
done: bool
Do we appear to have finished the transfer and validation process, successfully or otherwise?
extra_files
instance-attribute
extra_files: list[str]
Files not in index but in bucket, and client does not claim to have uploaded them during this session. This condition might indicate a prior erroneous upload or an incomplete index, and these should ideally be managed by the client.
fileframe
class-attribute
instance-attribute
fileframe: FileIndex | None = None
parsed file index as produced by fast.validation.parse_index_file()
last_log
instance-attribute
last_log: DataFrame
last parsed log chunk, or empty dataframe if none yet read
last_time
instance-attribute
last_time: float
Timestamp (as UNIX epoch time) of last client message, or, if there aren't any yet, of this object's initialization.
log
class-attribute
instance-attribute
log: DataFrame = None
parsed log read so far, or empty dataframe if none yet read
log_tail
instance-attribute
log_tail: list[str]
Container for streaming chunks from the client log.
logbuf
instance-attribute
logbuf: StringIO
buffer of concatenated, unparsed text read from log
missing_timeout
instance-attribute
missing_timeout: float
How many seconds we will allow to elapse between messages from the client before deciding something funny might be going on; we will log it and prepare cleanup tasks but not fully shut down until we hit transfer_timeout.
n_completed
class-attribute
instance-attribute
n_completed: int = 0
Number of files that have completed transfer (not necessarily passed validation)
n_failures
class-attribute
instance-attribute
n_failures: int = 0
Number of files that have failed an upload attempt or failed to pass validation
reader
instance-attribute
reader: S3TSVReader
Object responsible for reading and parsing the client log.
session_invalid
class-attribute
instance-attribute
session_invalid: bool = False
Have we encountered enough errors that we are going to ask the client to shut down, and refuse to validate any more files?
should_continue
property
should_continue: bool
Does it look like we should keep working or not?
transfer_complete
property
transfer_complete: bool
Has the client completed its transfer (valid or not?)
transfer_timeout
instance-attribute
transfer_timeout: float
How many seconds we will allow to elapse between messages from the client before deciding that they've quit without telling us, and shut ourselves down after completing any pending validation tasks.
wrong_files
instance-attribute
wrong_files: list[str]
Files not in index that client claims to have uploaded (or tried to upload) during this session. (Something has gone wrong!)
check_timeout
check_timeout() -> bool
Update our knowledge of the client's missing/absent status and its last message time.
Returns:
-
bool–True if the client is absent, False if not.
stop
stop() -> None
Stop the object, which is to say: stop its reader.
update
update() -> tuple[bool, bool, bool, list[str]]
Update our knowledge of the progress of the transfer by examining the client log and checking our timeout thresholds.
Returns:
-
any_updates(bool) –True if there are any messages other than keepalive entries; False if there aren't
-
stopped_running(bool) –True if report or timeout indicates that client has stopped running for whatever reason (or is in the process of it), False if not
-
any_problems(bool) –True if there are any error messages or impermissible transfers, False if there aren't
-
valid_transfers(list[str]) –List of keys that appear to represent permissible transfers