I am trying to estimate the 3D pose of a person who is observed with a single camera and 5 worn IMUs (limb extremities and upper-back). The camera frames are converted to shape-based feature vectors, and the IMUs each provide 4D quaternion representations of their orientation.
I have recovered the 3D pose using each modality by learning a mapping from the input feature space to the output pose space. Now I wish to obtain better results by combining both modalities in some way through sensor fusion.
I have tried appending the feature vectors of each modality and also using a weighted average of their outputs. These are very simple approaches, and only resulted in very small improvements on average.
What other approaches can I try to combine these two incommensurate data sources?
Is there any preprocessing on the features that should be done?
Note: My preference is to continue using a learning-based approach if possible. (i.e. I do not want to explicitly model the physics/kinematics/etc)