Skip to content

State

FileIndex module-attribute

FileIndex: TypeAlias = pd.DataFrame

Alias for a DataFrame created from a file index and a dataset label. It has the following columns:

  • path: path to file from bucket / tree root.
  • filename: name of the file
  • type: file type (if any) as inferred from filename, based on definitions in label
  • state: general state of the file; legal values are the members of FileState.
  • n_fail: number of failed upload attempts as reported by client

FileState module-attribute

FileState = Literal['waiting', 'validated', 'queued', 'invalid', 'error', 'missing']

Codes for validation states of a file.

Meanings are:

  • waiting: Not in bucket, client does not claim to have uploaded it
  • validated: In bucket, passed validation
  • queued: In bucket, pending validation
  • invalid: In bucket, failed validation
  • missing: Not in bucket, client claims to have uploaded it

ValidationState

ValidationState(index: FileIndex, label: Label, reader: S3TSVReader, transfer_timeout: float = 600, missing_timeout: float = 240, max_failures: int = 10)

The validation pipeline's model of the current state of the transfer.

Attributes are described inline at the bottom of the class definition.

Should typically only be used by fast.validation.ValidationSession.

Design note

This class is primarily 'informational'. It doesn't need to be a bare namespace--parsing and munging methods are acceptable-- but it should not sprout methods that execute pipeline tasks.

client_absent class-attribute instance-attribute

client_absent: bool = False

True if the client application didn't tell us it stopped, but it's stopped talking for longer than transfer_timeout

client_missing class-attribute instance-attribute

client_missing: bool = False

True if the client application didn't tell us it stopped, but it's stopped talking for longer than missing_timeout

client_on property

client_on: bool | None

True if we believe the client application is running, False if not, None if we don't think we know. This is directly derived from client_absent and client_status.

client_status class-attribute instance-attribute

client_status: ClientStatus = 'unchecked'

Our belief about the general condition of the client application, as based strictly on what it has reported and not reported. Our timeouts are a backup for cases in which the client stops working but is unable to log this fact (e.g. logging bug, power outage, OS-level kill). This attribute, together with client_absent and session_invalid, tells us whether we should shut down the validation pipeline.

done property

done: bool

Do we appear to have finished the transfer and validation process, successfully or otherwise?

extra_files instance-attribute

extra_files: list[str]

Files not in index but in bucket, and client does not claim to have uploaded them during this session. This condition might indicate a prior erroneous upload or an incomplete index, and these should ideally be managed by the client.

fileframe class-attribute instance-attribute

fileframe: FileIndex | None = None

parsed file index as produced by fast.validation.parse_index_file()

last_log instance-attribute

last_log: DataFrame

last parsed log chunk, or empty dataframe if none yet read

last_time instance-attribute

last_time: float

Timestamp (as UNIX epoch time) of last client message, or, if there aren't any yet, of this object's initialization.

log class-attribute instance-attribute

log: DataFrame = None

parsed log read so far, or empty dataframe if none yet read

log_tail instance-attribute

log_tail: list[str]

Container for streaming chunks from the client log.

logbuf instance-attribute

logbuf: StringIO

buffer of concatenated, unparsed text read from log

missing_timeout instance-attribute

missing_timeout: float

How many seconds we will allow to elapse between messages from the client before deciding something funny might be going on; we will log it and prepare cleanup tasks but not fully shut down until we hit transfer_timeout.

n_completed class-attribute instance-attribute

n_completed: int = 0

Number of files that have completed transfer (not necessarily passed validation)

n_failures class-attribute instance-attribute

n_failures: int = 0

Number of files that have failed an upload attempt or failed to pass validation

reader instance-attribute

reader: S3TSVReader

Object responsible for reading and parsing the client log.

session_invalid class-attribute instance-attribute

session_invalid: bool = False

Have we encountered enough errors that we are going to ask the client to shut down, and refuse to validate any more files?

should_continue property

should_continue: bool

Does it look like we should keep working or not?

transfer_complete property

transfer_complete: bool

Has the client completed its transfer (valid or not?)

transfer_timeout instance-attribute

transfer_timeout: float

How many seconds we will allow to elapse between messages from the client before deciding that they've quit without telling us, and shut ourselves down after completing any pending validation tasks.

wrong_files instance-attribute

wrong_files: list[str]

Files not in index that client claims to have uploaded (or tried to upload) during this session. (Something has gone wrong!)

check_timeout

check_timeout() -> bool

Update our knowledge of the client's missing/absent status and its last message time.

Returns:

  • bool

    True if the client is absent, False if not.

stop

stop() -> None

Stop the object, which is to say: stop its reader.

update

update() -> tuple[bool, bool, bool, list[str]]

Update our knowledge of the progress of the transfer by examining the client log and checking our timeout thresholds.

Returns:

  • any_updates ( bool ) –

    True if there are any messages other than keepalive entries; False if there aren't

  • stopped_running ( bool ) –

    True if report or timeout indicates that client has stopped running for whatever reason (or is in the process of it), False if not

  • any_problems ( bool ) –

    True if there are any error messages or impermissible transfers, False if there aren't

  • valid_transfers ( list[str] ) –

    List of keys that appear to represent permissible transfers