So I’m trying to parse school’s website for some info. I’m trying to get some values using xpath. So I found a html 5 parser and it can’t properly parse the first line. Then I figure you it’s actually XHTML and not HTML. After quick Google search I found out XHTML can be properly parsed using any XML parser and so I found one and… It can’t parse the first line. So I ask LLama3.1 (like a real programmer) why I can’t parse the first line with any parser. It explained so nicely that I did not destroy my keyboard when I was told that this document is “XHTML 1.0 Transitional” and it’s a mix of HTML 4 and XHTML and can’t be parsed with HTML nor XML parser. I hate the guy that invented that so much…

So I can’t find a crate to parse XHTML 1.0 transitional? Or a crate to convert xhtml to something else? Any advice?

  • taladar@sh.itjust.works
    link
    fedilink
    arrow-up
    2
    ·
    2 months ago

    Have you tried some tag soup parser? That should work as a last resort even if the ones building a tree structure don’t.