Hi both,
That's an interesting question. We discussed the issue of normalisation in this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8648261/, which is relevant to your question. Indeed, normalisation of the stimulus features and/or neural signal will change the magnitude of the TRF. Now, whatever normalisation approach is used (or not used), in principle you would think that comparing TRFs between different conditions is straight forward. We can compare different TRFs within the same data (e.g., attended vs. unattended speech envelope), or different TRFs fit on different portions of the data. The former is easier as the TRFs are built on the same neural data. The latter needs to be discussed a bit more.
First of all, the regression weights used for predicting unseen EEG. Large weights (positive or negative) indicate latencies that are important for that prediction. As Jade mentioned, there are similarities with ERPs. For example, we would expect a "strong and robust TRF component" to have larger weights than a "baseline", which can be calculated TRF-style with some type of randomisation (shuffle trials or feature values), but you could also compare (ERP-style) the component with "pre-stimulus" lags (excluding the side artifact, which can emerge for one or two lags on both sides).
What could cause different magnitudes between different conditions. The issues are somewhat similar to ERPs. You are comparing TRFs for two different portions of data. So, assuming that the stimulus features have the same statistics, the only difference is that the data is not the same. If you are working with large portions of clean adult data, then there are probably no issues in performing that comparison. The "pre-stimulus" and other baselines will be similar for the two conditions, making it possible to compare the magnitude of the interesting parts of the TRFs. However, consider a noisy small dataset. In that case, one condition might be noisier than the other, or maybe both conditions are very noisy. In those cases, TRFs for different subjects could have very different magnitude, which means that their (non-normalised) average would be primarily reflecting the subjects with stronger magnitude. However, a TRF can have very large weights but very low prediction correlation, meaning that the average would not be particularly meaningful in that case.
In brief, I think that the TRF weights are comparable in the ideal case with lots of clean data. However, you would first need to ensure that the baseline (e.g., pre-stim power or shuffled TRF) is similar for the two conditions. The important is that you explain in detail (e.g., in the paper) whatever you do. I would still first look at TRF components within condition, which is much easier and very informative.
Cheers,
Giovanni