I don’t have much experience with audio processing so my advice here will unfortunately be limited.
As I understand it, there are several preprocessing steps you can do to reduce the dimensionality of the input signals. Examples include using histograms, spectrograms or MFCCs, and then doing the classification on the processed data (or processed features).
I recommend checking out @tbekolay’s PhD thesis. He discusses these approaches in it.