I can help you with this, but let me first say that this is not a trivial project (I know this from experience having done exactly what you propose).
You have some decisions to make right off the bat.
1. Is the time of the output transition infinitely variable? If so, then what you're basically doing is comparing the audio with a triangle waveform and sending this comparator output to the high current output stage. If not, then you're updating the output at some fixed frequency (say 250 kHz) by sending it a bitstream (at say 250 kbits/sec), the bitstream being calculated by digital logic such as a sigma-delta modulator in an FPGA.
2. Is the output bridged? If so, then you can output either 00, 01, 10 or 11. These cause the speaker to see a voltage of 0, -1, +1 or 0 respectively. To play a slower waveform such as a 50 Hz drumbeat, what you'll be doing is mixing the 0 and +1 in the correct proportion for the positive part of the cycle (10 ms) and mixing the 0 and -1 in the correct proportion for the negative part of the cycle (10 ms). If the output is not bridged, then you can output either 0 or 1. These cause the speaker to see a voltage of -1 or +1 respectively. For silence you'll be playing -1 and +1 in equal proportions, and for music you'll be varying those proportions. This will require a huge and bulky output filter and will waste power.
3. Are you going to construct the waveforms yourself or are you going to use a commercial IC? Frankly, I would recommend going the latter way unless you are prepared to do a fair bit of development. There are loads of class D ICs available. This is because they are used in laptops. Some of them probably have the ability to add your own output stage. In that case you'll have to worry about issues like dead time and so forth.
4. Will the output(s) be driven symmetrically? That is to say, will you have a transistor that connects the speaker to +V and another transistor that connects the speaker to -V? That's probably the sensible way to do it (alternatively, you could build it like a class A output which basically has a transistor to pull the output down and a resistor or inductor to pull it up). How will you generate gate drive signals and the dead time?
5. Will you use bipolar or FET transistors? Will you use complementary types... so PNP to pull output up and NPN to pull output down in the bipolar case, or P-type FET to pull output up and N-type FET to pull output down in the FET case... or will you use only N-type FETs with a floating driver? How will you drive the high-side transistor or FET? Will you bootstrap it with a capacitor, use an opto-isolator, a transformer?
Let us have some decisions on the above matters and I am sure that we can help you further.
How are you on op-amp topologies? If you know how to build an inverting amplifier with a virtual earth... then you know how to build an integrator as well. (An integrator has only a capacitor in the feedback path and no resistor). So what you do is create a square wave and then integrate it to get your triangle wave. Put the triangle wave into a Schmitt trigger and feed that back as the square wave to be integrated... and then you get a triangle wave that increases linearly to the Schmitt high threshold, then decreases linearly to the low threshold, and repeats.
cheers, Nick