You are absolutely right. There are many approaches for calculating the envelope that lead to slightly different results. In addition, different studies may be using distinct approaches, so it is important to take that choice into account. Both the approaches you mentioned are valid, and there are other ways too. The Hilbert envelope is based on the Hilbert transform, while mTRFenvelope is implementing the approach from Lalor & Foxe 2010 (which is based on a rescaling of the power of the signal). Another approach I have used consists of getting the spectrogram and summing the envelopes across all frequency-bands to get a single, broadband envelope. The latter method, for example, may better preserve higher audio frequencies, which may instead be less present in the Hilbert envelope. And the TRF result may be a bit different for distinct envelopes depending on the importance of higher sound frequencies in the particular dataset.
So, there is no easy answer. It really depends on what you want the envelope to capture. Indeed, you could use more detailed representations, such as the spectrogram (e.g., Di Liberto et al., Curr Bio, 2015), but that involves multiple stimulus dimensions, which may be good when using forward models but can be problematic with backward models. I generally use the Hilbert Envelope and, depending on the target of my investigation, at the end of my investigation, I
may also check that my results holds when using other envelopes (which is usually the case).
In my view, it's not a matter of determining which procedure leads to the "best envelope". Instead, we should be thinking about which procedure leads to the envelope that is most interesting for our investigation (while keeping in mind that there exist multiple approaches).
I hope this helps!