Constructing an SNN for supervised learning is kinda a black art. There is no real “correct” way to approach things, and the architecture of your network depends a lot on the problem you are trying to solve, or the dynamics you are trying to learn. For example, if you know that the inputs are independent of each other (e.g., you know that changing one input always has the same effect on the output, regardless of what the other input is), then a network similar to your second network would be the better approach. However, if you know that the inputs are dependent, or, you don’t know what the dependency is, the first network will probably be what you want.
There are other factors that affect the success of your network as well: the number of neurons you are using in the network, the learning rate, and the synaptic time constants all play a part.
The Nengo examples page does have one example that trains a network using 2 inputs, and that is the “learning a product” example. In this example, the network is being trained to compute the product of two inputs, i.e., what is $x1 \times x2$. It is important to note that in this network, the “input” ensemble is 2 dimensional, and that the individual inputs ($x1$, and $x2$) are projected to the different dimensions of this ensemble. This is done so that the neurons in the input ensemble is able to represent information from both $x1$ and $x2$ (because the 2D tuning curves projects parts of $x1$ and $x2$ into the neuron activity), and in this way, the learning rule is able to modulate the decoders in such a way to be able to learn the two-dimensional product function.
If the input ensemble were single dimensional, and both $x1$ and $x2$ were projected to that single dimension, then the input ensemble would be trying to learning the function on the input of $(x1 + x2)$, which is very hard to do.