Skip to content

utils

add_missing_categories_and_flatten(grouped_gsf_features, groupby_fields, groupby_type_)

Add missing categories and flatten the grouped GSF features.

Parameters:

Name Type Description Default
grouped_gsf_features DataFrame

The grouped GSF features.

required
groupby_fields list

The fields to group by.

required
groupby_type_ str

The type of grouping.

required

Returns:

Type Description
dict[str, int | float | float64]

dict[str, int | float | np.float64]: The flattened GSF features.

Source code in src/data/utils.py
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
def add_missing_categories_and_flatten(
    grouped_gsf_features: pd.DataFrame,
    groupby_fields: list[
        float | int | str | np.int64 | None | np.float64 | pd._libs.missing.NAType
    ],
    groupby_type_: str,
) -> dict[str, int | float | np.float64]:
    """
    Add missing categories and flatten the grouped GSF features.

    Args:
        grouped_gsf_features (pd.DataFrame): The grouped GSF features.
        groupby_fields (list): The fields to group by.
        groupby_type_ (str): The type of grouping.

    Returns:
        dict[str, int | float | np.float64]: The flattened GSF features.
    """
    new_index = (
        grouped_gsf_features.index.union(
            pd.Index(groupby_fields),
        )
        .drop_duplicates()
        .dropna()
    )
    if len(groupby_fields) < len(new_index):
        logger.warning(
            f'Missing categories: {new_index.difference(groupby_fields)} in {
                groupby_type_
            }!',
        )
    grouped_gsf_features = grouped_gsf_features.reindex(
        new_index,
        fill_value=0,
    )
    grouped_df_reset = grouped_gsf_features.reset_index()

    melted_ = grouped_df_reset.melt(
        # Use the first column as the id_vars
        id_vars=grouped_df_reset.columns[0],
        var_name='variable',  # Name of the new variable column
        value_name='value',  # Name of the new value column
    )
    # If you want to add the groupby_type_ to the feature name so have feature names
    melted_['feature_name'] = (
        groupby_type_ + '_' + melted_['index'].astype(str) + '_' + melted_['variable']
    )
    res_df = (
        melted_[['feature_name', 'value']]
        .set_index(
            'feature_name',
        )
        .sort_index()
    )
    # create a dict {feature_name: value}
    res_dict = res_df.to_dict()['value']
    return res_dict

add_missing_features(et_data, trial_groupby_columns, mode)

Add and transform features in the given DataFrame.

This function adds and transforms several features in the DataFrame. It also creates new features based on existing ones.

Parameters:

Name Type Description Default
et_data DataFrame

The input DataFrame. It should have the following columns: - ptb_pos - is_content_word - NEXT_FIX_INTEREST_AREA_INDEX - CURRENT_FIX_INTEREST_AREA_INDEX - IA_REGRESSION_IN_COUNT - IA_REGRESSION_OUT_FULL_COUNT - IA_FIXATION_COUNT

required
trial_groupby_columns list

A list of column names to group by when calculating sums.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The DataFrame with added and transformed features.

DataFrame

The function creates the following new features: - ptb_pos: Transformed from categorical to numerical using a mapping dictionary. - is_content_word: Converted to integer type. - is_reg: Whether the next fixation interest area index is less than the current one. - is_progressive: Whether the next fixation IA index is greater than the current one. - is_reg_sum: The sum of is_reg for each group defined by trial_groupby_columns. - is_progressive_sum: The sum of is_progressive for each group defined by trial_groupby_columns. - IA_REGRESSION_IN_COUNT_sum: The sum of IA_REGRESSION_IN_COUNT for each group defined by trial_groupby_columns. - normalized_outgoing_regression_count: The ratio of IA_REGRESSION_OUT_FULL_COUNT to is_reg_sum. - normalized_outgoing_progressive_count: The ratio of the difference between IA_FIXATION_COUNT and IA_REGRESSION_OUT_FULL_COUNT to is_progressive_sum. - normalized_incoming_regression_count: The ratio of IA_REGRESSION_IN_COUNT to IA_REGRESSION_IN_COUNT_sum.

These are used for Syntactic Clusters with

Universal Dependencies PoS and Information Clusters [Berzak et al. 2017]

  • LengthCategory: The length category of the word based on the word_length column.
  • LengthCategory_normalized_IA_DWELL_TIME: IA_DWELL_TIME normalized by the mean IA_DWELL_TIME of the LengthCategory group.
  • POS_normalized_IA_DWELL_TIME: IA_DWELL_TIME normalized by the mean IA_DWELL_TIME of the universal_pos group.
  • LengthCategory_normalized_IA_FIRST_FIXATION_DURATION: IA_FIRST_FIXATION_DURATION normalized by the mean IA_FIRST_FIXATION_DURATION of the LengthCategory group.
  • POS_normalized_IA_FIRST_FIXATION_DURATION: IA_FIRST_FIXATION_DURATION normalized by the mean IA_FIRST_FIXATION_DURATION of the universal_pos group.
Source code in src/data/utils.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
def add_missing_features(
    et_data: pd.DataFrame,
    trial_groupby_columns: list[str],
    mode: DataType,
) -> pd.DataFrame:
    """
    Add and transform features in the given DataFrame.

    This function adds and transforms several features in the DataFrame. It also creates
    new features based on existing ones.

    Args:
        et_data (pd.DataFrame): The input DataFrame. It should have the following columns:
            - ptb_pos
            - is_content_word
            - NEXT_FIX_INTEREST_AREA_INDEX
            - CURRENT_FIX_INTEREST_AREA_INDEX
            - IA_REGRESSION_IN_COUNT
            - IA_REGRESSION_OUT_FULL_COUNT
            - IA_FIXATION_COUNT
        trial_groupby_columns (list): A list of column names to group by when calculating sums.

    Returns:
        pd.DataFrame: The DataFrame with added and transformed features.
        The function creates the following new features:
            - ptb_pos: Transformed from categorical to numerical using a mapping dictionary.
            - is_content_word: Converted to integer type.
            - is_reg: Whether the next fixation interest area index is less than the current one.
            - is_progressive: Whether the next fixation IA index is greater than the current one.
            - is_reg_sum: The sum of is_reg for each group defined by trial_groupby_columns.
            - is_progressive_sum:
                The sum of is_progressive for each group defined by trial_groupby_columns.
            - IA_REGRESSION_IN_COUNT_sum:
                The sum of IA_REGRESSION_IN_COUNT for each group defined by trial_groupby_columns.
            - normalized_outgoing_regression_count:
                The ratio of IA_REGRESSION_OUT_FULL_COUNT to is_reg_sum.
            - normalized_outgoing_progressive_count:
                The ratio of the difference between IA_FIXATION_COUNT and
                IA_REGRESSION_OUT_FULL_COUNT to is_progressive_sum.
            - normalized_incoming_regression_count:
                The ratio of IA_REGRESSION_IN_COUNT to IA_REGRESSION_IN_COUNT_sum.
            # These are used for Syntactic Clusters with
            # Universal Dependencies PoS and Information Clusters [Berzak et al. 2017]
            - LengthCategory:
                The length category of the word based on the word_length column.
            - LengthCategory_normalized_IA_DWELL_TIME:
                IA_DWELL_TIME normalized by the mean IA_DWELL_TIME of the LengthCategory group.
            - POS_normalized_IA_DWELL_TIME:
                IA_DWELL_TIME normalized by the mean IA_DWELL_TIME of the universal_pos group.
            - LengthCategory_normalized_IA_FIRST_FIXATION_DURATION:
                IA_FIRST_FIXATION_DURATION normalized by the mean IA_FIRST_FIXATION_DURATION of the
                LengthCategory group.
            - POS_normalized_IA_FIRST_FIXATION_DURATION:
                IA_FIRST_FIXATION_DURATION normalized by the mean IA_FIRST_FIXATION_DURATION of the
                universal_pos group.
    """
    # Map ptb_pos values to numbers
    value_to_number = {'FUNC': 0, 'NOUN': 1, 'VERB': 2, 'ADJ': 3, 'UNKNOWN': 4}
    et_data['ptb_pos'] = et_data['ptb_pos'].map(value_to_number)

    # Convert is_content_word to integer
    et_data['is_content_word'] = et_data['is_content_word'].astype('Int64')

    # TODO Add reference to a paper for these bins?
    # Define the boundaries of the bins by word length
    bins = [0, 2, 5, 11, np.inf]  # 0-1, 2-4, 5-10, 11+
    # Define the labels for the bins
    labels = [0, 1, 2, 3]
    et_data['LengthCategory'] = pd.cut(
        et_data['word_length'],
        bins=bins,
        labels=labels,
        right=False,
    )

    if mode == DataType.FIXATIONS:
        # Add is_reg and is_progressive features
        et_data['is_reg'] = (
            et_data['NEXT_FIX_INTEREST_AREA_INDEX']
            < et_data['CURRENT_FIX_INTEREST_AREA_INDEX']
        )
        et_data['is_progressive'] = (
            et_data['NEXT_FIX_INTEREST_AREA_INDEX']
            > et_data['CURRENT_FIX_INTEREST_AREA_INDEX']
        )

        # Calculate sums for is_reg, is_progressive, and IA_REGRESSION_IN_COUNT
        grouped_sums = et_data.groupby(trial_groupby_columns)[
            ['is_reg', 'is_progressive', 'IA_REGRESSION_IN_COUNT']
        ].transform('sum')

        # Add sum features
        et_data['is_reg_sum'] = grouped_sums['is_reg']
        et_data['is_progressive_sum'] = grouped_sums['is_progressive']
        et_data['IA_REGRESSION_IN_COUNT_sum'] = grouped_sums['IA_REGRESSION_IN_COUNT']

        # Add normalized count features
        et_data['normalized_outgoing_regression_count'] = (
            et_data['IA_REGRESSION_OUT_FULL_COUNT'] / et_data['is_reg_sum']
        )
        et_data['normalized_outgoing_progressive_count'] = (
            et_data['IA_FIXATION_COUNT'] - et_data['IA_REGRESSION_OUT_FULL_COUNT']
        ) / et_data['is_progressive_sum']  # approximation
        et_data['normalized_incoming_regression_count'] = (
            et_data['IA_REGRESSION_IN_COUNT'] / et_data['IA_REGRESSION_IN_COUNT_sum']
        )
        et_data = et_data.replace([np.inf, -np.inf], 0)
        et_data.fillna(
            {
                'normalized_outgoing_regression_count': 0,
                'normalized_outgoing_progressive_count': 0,
                'normalized_incoming_regression_count': 0,
            },
            inplace=True,
        )

    et_data.fillna(
        {
            'LengthCategory_normalized_IA_DWELL_TIME': 0,
            'universal_pos_normalized_IA_DWELL_TIME': 0,
            'LengthCategory_normalized_IA_FIRST_FIXATION_DURATION': 0,
            'universal_pos_normalized_IA_FIRST_FIXATION_DURATION': 0,
        },
        inplace=True,
    )

    return et_data

compute_fixation_trial_level_features(trial, groupby_mappings, processed_data_path)

Compute fixation trial-level features.

Parameters:

Name Type Description Default
trial DataFrame

The trial data.

required
groupby_mappings list[tuple]

The groupby mappings for categorical features.

required
processed_data_path Path

The path to save the trial level feature names.

required

Returns:

Type Description
dict

pd.Series: The computed features.

Source code in src/data/utils.py
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
def compute_fixation_trial_level_features(
    trial: pd.DataFrame, groupby_mappings: list[tuple], processed_data_path: Path
) -> dict:
    """
    Compute fixation trial-level features.

    Args:
        trial (pd.DataFrame): The trial data.
        groupby_mappings (list[tuple]): The groupby mappings for categorical features.
        processed_data_path (Path): The path to save the trial level feature names.

    Returns:
        pd.Series: The computed features.
    """

    RF = {}
    for column in numerical_fixation_trial_columns:
        if column in trial.columns:
            for aggregation_method in numerical_feature_aggregations:
                key = 'fix_feature_' + aggregation_method + '_' + column
                val = get_feature_from_list(
                    trial[column].replace('.', np.nan).astype(float),
                    aggregation_method,
                )
                RF.update({key: val})

    ####### David Eyes Only ######
    BEYELSTM = {}
    gaze_features = get_gaze_entropy_features(
        x_means=trial['CURRENT_FIX_X'].values,  # type: ignore
        y_means=trial['CURRENT_FIX_Y'].values,  # type: ignore
    )
    BEYELSTM.update(gaze_features)

    BEYELSTM['total_num_fixations'] = len(trial)
    BEYELSTM['total_num_words'] = (
        trial['TRIAL_IA_COUNT'].drop_duplicates().dropna().values[0]
    )

    ##### David #####
    # Creates
    # 'LengthCategory_normalized_IA_FIRST_FIXATION_DURATION',
    # 'LengthCategory_normalized_IA_DWELL_TIME',
    # 'universal_pos_normalized_IA_DWELL_TIME',
    # 'universal_pos_normalized_IA_FIRST_FIXATION_DURATION',
    for cluster_by in ['LengthCategory', 'universal_pos']:
        # FutureWarning: The default of observed=False is deprecated and will be changed to True
        # in a future version of pandas. Pass observed=False to retain current behavior or
        # observed=True to adopt the future default and silence this warning.
        try:
            grouped_means = trial.groupby(cluster_by, observed=False)[
                ['IA_DWELL_TIME', 'IA_FIRST_FIXATION_DURATION']
            ].transform('mean')
        except IndexError:
            grouped_means['IA_DWELL_TIME'] = np.nan
            grouped_means['IA_FIRST_FIXATION_DURATION'] = np.nan
        for et_measure in ['IA_DWELL_TIME', 'IA_FIRST_FIXATION_DURATION']:
            trial[f'{cluster_by}_normalized_{et_measure}'] = (
                trial[et_measure] / grouped_means[et_measure]
            )
    # No. values in each groupby type:
    # is_content_word 2 (in beyelstm originally 3)
    # ptb_pos 5 (in beyelstm originally 5)
    # entity_type 20 (in beyelstm originally 11)
    # universal_pos 17 (in beyelstm originally 16)
    for groupby_type_, groupby_fields in groupby_mappings:
        # TODO This shouldn't be hardcoded here
        if groupby_type_ == 'ptb_pos':
            value_to_number = {
                'FUNC': 0,
                'NOUN': 1,
                'VERB': 2,
                'ADJ': 3,
                'UNKNOWN': 4,
            }
            trial['ptb_pos'] = trial['ptb_pos'].map(value_to_number)
        grouped_gsf_features = trial.groupby(groupby_type_)[gsf_features].mean()
        melted_gsf_features = add_missing_categories_and_flatten(
            grouped_gsf_features=grouped_gsf_features,
            groupby_fields=groupby_fields,
            groupby_type_=groupby_type_,
        )
        for feature_name, feature_value in melted_gsf_features.items():
            BEYELSTM[feature_name] = feature_value

    SVM = {}
    # mean saccade duration -> mean "NEXT_SAC_DURATION"
    to_compute_features = [
        'NEXT_SAC_DURATION',
        'NEXT_SAC_AVG_VELOCITY',
        'NEXT_SAC_AMPLITUDE',
    ]
    for feature_to_compute in to_compute_features:
        SVM[feature_to_compute + '_mean'] = trial[feature_to_compute].mean()
        SVM[feature_to_compute + '_max'] = trial[feature_to_compute].max()

    # Diane
    LOGISTIC = {}
    LOGISTIC['CURRENT_FIX_DURATION_mean'] = trial['CURRENT_FIX_DURATION'].mean()

    # mean forward saccade length:
    # * "normalized_ID_plus_1" = "normalized_ID" of the next fixation (row)
    trial['normalized_ID_plus_1'] = trial['normalized_ID'].shift(-1)
    # * mean "NEXT_SAC_AMPLITUDE" where "normalized_ID_plus_1" > "normalized_ID"
    forward_saccade_length = trial[
        trial['normalized_ID_plus_1'] > trial['normalized_ID']
    ]['NEXT_SAC_AMPLITUDE'].mean()
    LOGISTIC['forward_saccade_length_mean'] = forward_saccade_length

    # regression rate - backward saccade rate
    # * using "normalized_ID_plus_1" = "normalized_ID" of the next fixation (row)
    # * regression rate - % of rows where "normalized_ID_plus_1" < "normalized_ID"
    regression_rate = (
        trial['normalized_ID_plus_1'] < trial['normalized_ID']
    ).sum() / len(trial)
    LOGISTIC['regression_rate'] = regression_rate

    features_dict = {
        'RF': RF,
        'BEYELSTM': BEYELSTM,
        'SVM': SVM,
        'LOGISTIC': LOGISTIC,
    }
    save_feature_names_if_do_not_exist(
        features_dict=features_dict,
        csv_path=processed_data_path / 'fixation_trial_level_feature_keys.csv',
        mode=DataType.FIXATIONS,
    )

    return RF | BEYELSTM | SVM | LOGISTIC

compute_ia_trial_level_features(trial, processed_data_path)

Compute IA trial-level features.

Parameters:

Name Type Description Default
trial DataFrame

The trial data.

required

Returns:

Type Description
dict

pd.Series: The computed features.

Source code in src/data/utils.py
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
def compute_ia_trial_level_features(
    trial: pd.DataFrame, processed_data_path: Path
) -> dict:
    """
    Compute IA trial-level features.

    Args:
        trial (pd.DataFrame): The trial data.

    Returns:
        pd.Series: The computed features.
    """

    RF = {}
    for column in numerical_ia_trial_columns:
        if column in trial.columns:
            for aggregation_method in numerical_feature_aggregations:
                RF.update(
                    {
                        'ia_feature_'
                        + aggregation_method
                        + '_'
                        + column: get_feature_from_list(
                            trial[column].astype(float), aggregation_method
                        )
                    }
                )

    SVM = {}
    SVM.update(
        {
            'skip_rate': trial['total_skip'].mean(),
            'num_of_fixations': trial['IA_FIXATION_COUNT'].sum(),
            'mean_TFD': trial['IA_DWELL_TIME'].mean(),
        }
    )

    # Diane
    # https://tmalsburg.github.io/MeziereEtAl2021MS.pdf
    # go-past time (i.e., the sum of fixations on a word up to when it
    # is exited to its right, including all regressions to the left of the word
    LOGISTIC = {}
    LOGISTIC.update(
        {
            'first_pass_skip_rate': trial['IA_SKIP'].mean(),
            'mean_FFD': trial['IA_FIRST_FIXATION_DURATION'].mean(),
            'mean_GD': trial['IA_FIRST_RUN_DWELL_TIME'].mean(),
            'mean_TFD': trial['IA_DWELL_TIME'].mean(),
            'mean_go_past_time': trial['IA_SELECTIVE_REGRESSION_PATH_DURATION'].mean(),
            'reading_speed': calc_reading_speed(trial),
        }
    )

    READING_SPEED = {'reading_speed': calc_reading_speed(trial)}

    features_dict = {
        'RF': RF,
        'SVM': SVM,
        'LOGISTIC': LOGISTIC,
        'READING_SPEED': READING_SPEED,
    }
    save_feature_names_if_do_not_exist(
        features_dict=features_dict,
        csv_path=processed_data_path / 'ia_trial_level_feature_keys.csv',
        mode=DataType.IA,
    )

    return RF | SVM | LOGISTIC | READING_SPEED

compute_trial_level_features(raw_fixation_data, raw_ia_data, trial_groupby_columns, processed_data_path)

Compute trial-level features in parallel.

Parameters:

Name Type Description Default
raw_fixation_data DataFrame | None

The raw fixation data.

required
raw_ia_data DataFrame

The raw IA data.

required
trial_groupby_columns list[str]

The columns to group by for trials.

required
processed_data_path Path

The path to save the trial level feature names.

required

Returns:

Type Description
DataFrame

pd.DataFrame: The computed trial-level features.

Source code in src/data/utils.py
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
def compute_trial_level_features(
    raw_fixation_data: pd.DataFrame | None,
    raw_ia_data: pd.DataFrame,
    trial_groupby_columns: list[str],
    processed_data_path: Path,
) -> pd.DataFrame:
    """
    Compute trial-level features in parallel.

    Args:
        raw_fixation_data (pd.DataFrame | None): The raw fixation data.
        raw_ia_data (pd.DataFrame): The raw IA data.
        trial_groupby_columns (list[str]): The columns to group by for trials.
        processed_data_path (Path): The path to save the trial level feature names.

    Returns:
        pd.DataFrame: The computed trial-level features.
    """
    groupby_mappings = [
        (feature_name, list(raw_ia_data[feature_name].unique()))
        for feature_name in [
            'is_content_word',
            'ptb_pos',
            'entity_type',
            'universal_pos',
        ]
    ]
    logger.info(
        f'Computing trial level features for {raw_ia_data.shape[0]} trials with {groupby_mappings} groupby mappings'
    )
    ia_partial = partial(
        compute_ia_trial_level_features,
        processed_data_path=processed_data_path,
    )
    logger.info('This might take a couple of minutes, please be patient...')
    logger.info(
        f' Number of trial groups in ia: {len(raw_ia_data.groupby(trial_groupby_columns).groups)}'
    )
    ia_trial_features = raw_ia_data.groupby(trial_groupby_columns).apply(ia_partial)  # type: ignore
    ia_trial_features = pd.DataFrame(
        list(ia_trial_features), index=ia_trial_features.index
    ).fillna(0)

    if raw_fixation_data is not None:
        logger.info(
            f'Computing fixation trial level features for {raw_fixation_data.shape[0]} trials with {groupby_mappings} groupby mappings'
        )
        logger.info('This might take a couple of minutes, please be patient...')
        fixation_partial = partial(
            compute_fixation_trial_level_features,
            groupby_mappings=groupby_mappings,
            processed_data_path=processed_data_path,
        )
        logger.info(
            f'Number of trial groups in fix: {len(raw_fixation_data.groupby(trial_groupby_columns).groups)}'
        )
        logger.info('This might take a couple of minutes, please be patient...')
        fixation_trial_features = raw_fixation_data.groupby(
            trial_groupby_columns
        ).apply(fixation_partial)  # type: ignore
        fixation_trial_features = pd.DataFrame(
            list(fixation_trial_features), index=fixation_trial_features.index
        ).fillna(0)
        trial_level_features = pd.concat(
            [fixation_trial_features, ia_trial_features],
            axis=1,
        )
    else:
        trial_level_features = ia_trial_features

    return trial_level_features

get_feature_from_list(values, aggregation_function)

creates a feature for a list of values (e.g. mean or standard deviation of values in list) Args: values (list[int | float | np.int32 | np.float64]): list of values aggregation_function (str): name of function to be applied to list Returns: np.float64 | np.nan: aggregated value or np.nan if not possible

Source code in src/data/utils.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def get_feature_from_list(
    values: list[float | int | float | np.int32 | np.float64] | pd.Series,
    aggregation_function: str,
):
    """
    creates a feature for a list of values (e.g. mean or standard deviation of values in list)
    Args:
        values (list[int | float | np.int32 | np.float64]): list of values
        aggregation_function (str): name of function to be applied to list
    Returns:
        np.float64  | np.nan: aggregated value or np.nan if not possible
    """
    warnings.filterwarnings('ignore', category=RuntimeWarning)
    if np.sum(np.isnan(values)) == len(values):
        return np.nan
    if aggregation_function == 'mean':
        return np.nanmean(values)
    elif aggregation_function == 'std':
        return np.nanstd(values)
    elif aggregation_function == 'median':
        return np.nanmedian(values)
    elif aggregation_function == 'skew':
        not_nan_values = np.array(values)[~np.isnan(values)]
        return skew(not_nan_values)
    elif aggregation_function == 'kurtosis':
        not_nan_values = np.array(values)[~np.isnan(values)]
        return kurtosis(not_nan_values)
    elif aggregation_function == 'max':
        return np.nanmax(values)
    elif aggregation_function == 'min':
        return np.nanmin(values)
    else:
        return np.nan

get_gaze_entropy_features(x_means, y_means, x_dim=2560, y_dim=1440, patch_size=138)

Compute gaze entropy features.

Parameters:

Name Type Description Default
x_means ndarray

The x-coordinates of fixations.

required
y_means ndarray

The y-coordinates of fixations.

required
x_dim int

The screen horizontal pixels. Defaults to 2560.

2560
y_dim int

The screen vertical pixels. Defaults to 1440.

1440
patch_size int

The size of patches to use. Defaults to 138.

138

Returns:

Type Description
dict[str, int | float | float64]

dict[str, int | float | np.float64]: The gaze entropy features.

Source code in src/data/utils.py
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
def get_gaze_entropy_features(
    x_means: np.ndarray,
    y_means: np.ndarray,
    x_dim: int = 2560,
    y_dim: int = 1440,
    patch_size: int = 138,
) -> dict[str, int | float | np.float64]:
    """
    Compute gaze entropy features.

    Args:
        x_means (np.ndarray): The x-coordinates of fixations.
        y_means (np.ndarray): The y-coordinates of fixations.
        x_dim (int, optional): The screen horizontal pixels. Defaults to 2560.
        y_dim (int, optional): The screen vertical pixels. Defaults to 1440.
        patch_size (int, optional): The size of patches to use. Defaults to 138.

    Returns:
        dict[str, int | float | np.float64]: The gaze entropy features.
    """

    # Gaze entropy measures detect alcohol-induced driver impairment - ScienceDirect
    # https://www.sciencedirect.com/science/article/abs/pii/S0376871619302789
    # computes the gaze entropy features
    # params:
    #    x_means: x-coordinates of fixations
    #    y_means: y coordinates of fixations
    #    x_dim: screen horizontal pixels
    #    y_dim: screen vertical pixels
    #    patch_size: size of patches to use
    # Based on https://github.com/aeye-lab/etra-reading-comprehension
    def calc_patch(patch_size: int, mean: np.float64 | np.int64) -> float | int:
        return int(np.floor(mean / patch_size))

    def entropy(value: float) -> float:
        return value * (np.log(value) / np.log(2))

    # dictionary of visited patches
    patch_dict = {}
    # dictionary for patch transitions
    trans_dict = defaultdict(list)
    pre = None
    for i in range(len(x_means)):
        x_mean = x_means[i]
        y_mean = y_means[i]
        patch_x = calc_patch(patch_size, x_mean)
        patch_y = calc_patch(patch_size, y_mean)
        cur_point = f'{str(patch_x)}_{str(patch_y)}'
        if cur_point not in patch_dict:
            patch_dict[cur_point] = 0
        patch_dict[cur_point] += 1
        if pre is not None:
            trans_dict[pre].append(cur_point)
        pre = cur_point

    # stationary gaze entropy
    # SGE
    sge = 0.0
    x_max = int(x_dim / patch_size)
    y_max = int(y_dim / patch_size)
    fix_number = len(x_means)
    for i in range(x_max):
        for j in range(y_max):
            cur_point = f'{str(i)}_{str(j)}'
            if cur_point in patch_dict:
                cur_prop = patch_dict[cur_point] / fix_number
                sge += entropy(cur_prop)
    sge = sge * -1

    # gaze transition entropy
    # GTE
    gte = 0.0
    for patch in trans_dict:
        cur_patch_prop = patch_dict[patch] / fix_number
        cur_destination_list = trans_dict[patch]
        (values, counts) = np.unique(cur_destination_list, return_counts=True)
        inner_sum = 0.0
        for i in range(len(values)):
            cur_count = counts[i]
            cur_prob = cur_count / np.sum(counts)
            cur_entropy = entropy(cur_prob)
            inner_sum += cur_entropy
        gte += cur_patch_prop * inner_sum
    gte *= -1
    return {'fixation_feature_SGE': sge, 'fixation_feature_GTE': gte}

load_fold_data(fold_index, base_path, folds_folder_name, data_type, regime_name, set_name)

Load data for a specific fold, data type, regime, and set.

This method reads a Feather file containing the data for the specified fold index, data type, regime name, and set name.

Parameters:

Name Type Description Default
fold_index int

The index of the fold to load data for.

required
base_path Path

The base path where the data is stored.

required
folds_folder_name str

The name of the folder containing the folds.

required
data_type DataType

The type of data to load (e.g., train, test, etc.).

required
regime_name SetNames

The name of the regime (e.g., validation, training, etc.).

required
set_name SetNames

The name of the set (e.g., train, test, etc.).

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing the loaded data.

Note

The file path is currently hardcoded to 'data/OneStop/folds'. This should be replaced with a general path when a connection to the server is available.

Source code in src/data/utils.py
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def load_fold_data(
    fold_index: int,
    base_path: Path,
    folds_folder_name: str,
    data_type: DataType,
    regime_name: SetNames,
    set_name: SetNames,
) -> pd.DataFrame:
    """
    Load data for a specific fold, data type, regime, and set.

    This method reads a Feather file containing the data for the specified
    fold index, data type, regime name, and set name.

    Args:
        fold_index (int): The index of the fold to load data for.
        base_path (Path): The base path where the data is stored.
        folds_folder_name (str): The name of the folder containing the folds.
        data_type (DataType): The type of data to load (e.g., train, test, etc.).
        regime_name (SetNames): The name of the regime (e.g., validation, training, etc.).
        set_name (SetNames): The name of the set (e.g., train, test, etc.).

    Returns:
        pd.DataFrame: A DataFrame containing the loaded data.

    Note:
        The file path is currently hardcoded to 'data/OneStop/folds'. This should
        be replaced with a general path when a connection to the server is available.
    """
    df = pd.read_feather(
        base_path
        / folds_folder_name
        / f'fold_{fold_index}'
        / f'{data_type}_{set_name}_{regime_name}.feather'
    )
    for should_be_bool in ['total_skip', 'start_of_line', 'end_of_line']:
        if should_be_bool in df.columns:
            df[should_be_bool] = df[should_be_bool].astype(bool)
    return df

save_feature_names_if_do_not_exist(features_dict, csv_path, mode)

Save feature names to a CSV file if they do not already exist.

Source code in src/data/utils.py
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
def save_feature_names_if_do_not_exist(
    features_dict, csv_path: Path, mode: DataType
) -> None:
    """
    Save feature names to a CSV file if they do not already exist.
    """
    global_field_name = f'{mode}_TRIAL_LEVEL_FEATURE_KEYS_SAVED'
    if global_field_name not in globals():
        feature_rows = []
        for feature_type, feature_dict in features_dict.items():
            for feature_name in feature_dict.keys():
                feature_rows.append(
                    {'feature_name': feature_name, 'feature_type': feature_type}
                )
        csv_path.parent.mkdir(parents=True, exist_ok=True)
        pd.DataFrame(feature_rows).to_csv(csv_path, index=False)
        logger.info(f'Saved feature names to {csv_path}')
        globals()[global_field_name] = True