When teams say they want markdown, they usually mean something more specific: content that is compact, readable, and safe to pass into another system without extra cleanup.
Start with the page shape
The first useful question is not "How much text can we collect?" It is "Which parts of the page belong to the main story?"
If headers, sidebars, share widgets, and newsletter blocks survive too long, the final markdown becomes technically correct but operationally noisy.
Remove noise early
Extraction quality improves when boilerplate is filtered before formatting decisions are made.
That means pruning repeated navigation, recommendation blocks, cookie banners, and social chrome before they are allowed to influence heading order or paragraph grouping.
const cleaned = blocks
.filter((block) => !block.isBoilerplate)
.filter((block) => block.text.trim().length > 0);
Keep hierarchy intact
A clean markdown export is not just plain text with line breaks.
It needs enough structure for a reader or model to understand where sections begin, when a list is really a list, and which code or quote belongs together.
That is why heading fidelity matters so much. Once the outline is wrong, everything downstream gets harder.
