Electrolaryngeal (EL) speech is artificial speech produced using an electrolarynx to aid laryngectomees in communicating without vocal fold vibrations. Compared with normal speech, EL speech lacks essential phonetic features and differs in temporal structure, resulting in poor naturalness, speaker identity, and intelligibility. Sequence-to-sequence (seq2seq) voice conversion (VC) emerges as a promising technique for overcoming the challenges in EL-speech-to-normal-speech conversion (EL2SP). Nonetheless, most VC studies for EL2SP focus on converting clean EL speech, overlooking real-world scenarios where EL speech is interfered with background noise and reverberation. To address this, we propose a novel seq2seq VC-based training method. In contrast to relying on extra augmentation modules to tackle interferences, our method requires only a single framework. First, we pretrained a normal-tonormal seq2seq VC model, adapted from a text-to-speech model. Then, we employed a two-stage fine-tuning in a many-to-one style leveraging pseudo-noisy and reverberant EL speech data generated from limited clean data. We evaluated several system designs of our method. The intermediate representations of these systems were also analyzed to understand their role in filtering the interferences. Comparative experiments demonstrated that our method significantly outperforms EL2SP baselines, non-trivially handling both clean and noisy-reverberant EL speech, which sheds light on possible directions for improvement.