Why German speech recognition is its own thing
German has a few structural features that make speech recognition more interesting than for English or Spanish. Compound noun formation is essentially unlimited. Verbal morphology can move parts across the sentence. Capitalisation rules apply to all nouns. And the regional varieties (Hochdeutsch, Schweizerdeutsch, Austrian German, dialects) span enough phonetic and grammatical variation that a single "German model" has to handle a wide range.
Compound nouns and where the spaces go
German freely combines nouns into new compound words. "Donaudampfschiffahrtsgesellschaftskapitänsmütze" (the cap of the captain of the Danube steamship company) is one word in German, all spaces removed. The transcript has to get this right because writing "Donau Dampf Schiffahrts Gesellschafts Kapitäns Mütze" as separate words breaks the meaning entirely. Whisper learns from training data which sequences are conventionally written as one word.
For most everyday compounds (Krankenhaus, Lebensversicherung, Bundeskanzlerin), this works smoothly. For rare or technical compounds (industry jargon, legal terminology, scientific terms), Whisper may split where a human would join, or join where a human would split. The inline editor handles those edge cases.
Separable verbs and their split positions
Many common German verbs have a prefix that separates from the verb stem in present tense and moves to the end of the clause. "Anrufen" (to call) splits in "Ich rufe dich morgen an" (I call you tomorrow up). "Aufstehen" (to get up) splits in "Wir stehen um sieben auf" (we get up at seven up). The transcript renders the sentence as written, separated, but a German reader recognises the split verb. The point is that the transcript should not collapse "an" or "auf" into "anrufen" or "aufstehen" inline, because that would change the syntax. Whisper handles this correctly.
Capitalisation of all nouns
German capitalises every noun, not just proper nouns. "Das Haus", "die Stadt", "ein Buch" all stay capitalised mid-sentence. Sloppy ASR transcripts often lowercase everything except sentence starts and proper nouns, which produces text a German reader has to mentally fix. Whisper-trained-on-German keeps the conventions, so the transcript is publication-ready (or close to it) without a manual capitalisation pass.
Regional varieties: Hochdeutsch, Swiss, Austrian, dialects
Standard Hochdeutsch is what newsreaders, university lecturers, and most business communication uses. Whisper is strongest here. Austrian German (Österreichisches Hochdeutsch) is mostly Hochdeutsch with some vocabulary differences (Erdäpfel for potatoes, Jänner for January, Marille for apricot) and some pronunciation differences; transcription works well. Swiss German is the hard case: spoken Swiss German is sufficiently different from written Hochdeutsch that even native speakers of Hochdeutsch often struggle to follow. Whisper transcribes Swiss German as Hochdeutsch (giving you a "translated" written form), which is useful but loses dialect-specific vocabulary.
The ß question
Hochdeutsch uses ß (Eszett) in specific positions ("Straße", "Fußball"). Swiss German has not used ß for decades, writing ss in all positions ("Strasse", "Fussball"). The transcript follows the speaker variety: Swiss speakers get ss, German speakers get ß. If you need consistency across sources, normalise in the editor.