Path: blob/main/vendor/golang.org/x/net/html/doc.go
2880 views
// Copyright 2010 The Go Authors. All rights reserved.1// Use of this source code is governed by a BSD-style2// license that can be found in the LICENSE file.34/*5Package html implements an HTML5-compliant tokenizer and parser.67Tokenization is done by creating a Tokenizer for an io.Reader r. It is the8caller's responsibility to ensure that r provides UTF-8 encoded HTML.910z := html.NewTokenizer(r)1112Given a Tokenizer z, the HTML is tokenized by repeatedly calling z.Next(),13which parses the next token and returns its type, or an error:1415for {16tt := z.Next()17if tt == html.ErrorToken {18// ...19return ...20}21// Process the current token.22}2324There are two APIs for retrieving the current token. The high-level API is to25call Token; the low-level API is to call Text or TagName / TagAttr. Both APIs26allow optionally calling Raw after Next but before Token, Text, TagName, or27TagAttr. In EBNF notation, the valid call sequence per token is:2829Next {Raw} [ Token | Text | TagName {TagAttr} ]3031Token returns an independent data structure that completely describes a token.32Entities (such as "<") are unescaped, tag names and attribute keys are33lower-cased, and attributes are collected into a []Attribute. For example:3435for {36if z.Next() == html.ErrorToken {37// Returning io.EOF indicates success.38return z.Err()39}40emitToken(z.Token())41}4243The low-level API performs fewer allocations and copies, but the contents of44the []byte values returned by Text, TagName and TagAttr may change on the next45call to Next. For example, to extract an HTML page's anchor text:4647depth := 048for {49tt := z.Next()50switch tt {51case html.ErrorToken:52return z.Err()53case html.TextToken:54if depth > 0 {55// emitBytes should copy the []byte it receives,56// if it doesn't process it immediately.57emitBytes(z.Text())58}59case html.StartTagToken, html.EndTagToken:60tn, _ := z.TagName()61if len(tn) == 1 && tn[0] == 'a' {62if tt == html.StartTagToken {63depth++64} else {65depth--66}67}68}69}7071Parsing is done by calling Parse with an io.Reader, which returns the root of72the parse tree (the document element) as a *Node. It is the caller's73responsibility to ensure that the Reader provides UTF-8 encoded HTML. For74example, to process each anchor node in depth-first order:7576doc, err := html.Parse(r)77if err != nil {78// ...79}80for n := range doc.Descendants() {81if n.Type == html.ElementNode && n.Data == "a" {82// Do something with n...83}84}8586The relevant specifications include:87https://html.spec.whatwg.org/multipage/syntax.html and88https://html.spec.whatwg.org/multipage/syntax.html#tokenization8990# Security Considerations9192Care should be taken when parsing and interpreting HTML, whether full documents93or fragments, within the framework of the HTML specification, especially with94regard to untrusted inputs.9596This package provides both a tokenizer and a parser, which implement the97tokenization, and tokenization and tree construction stages of the WHATWG HTML98parsing specification respectively. While the tokenizer parses and normalizes99individual HTML tokens, only the parser constructs the DOM tree from the100tokenized HTML, as described in the tree construction stage of the101specification, dynamically modifying or extending the document's DOM tree.102103If your use case requires semantically well-formed HTML documents, as defined by104the WHATWG specification, the parser should be used rather than the tokenizer.105106In security contexts, if trust decisions are being made using the tokenized or107parsed content, the input must be re-serialized (for instance by using Render or108Token.String) in order for those trust decisions to hold, as the process of109tokenization or parsing may alter the content.110*/111package html // import "golang.org/x/net/html"112113// The tokenization algorithm implemented by this package is not a line-by-line114// transliteration of the relatively verbose state-machine in the WHATWG115// specification. A more direct approach is used instead, where the program116// counter implies the state, such as whether it is tokenizing a tag or a text117// node. Specification compliance is verified by checking expected and actual118// outputs over a test suite rather than aiming for algorithmic fidelity.119120// TODO(nigeltao): Does a DOM API belong in this package or a separate one?121// TODO(nigeltao): How does parsing interact with a JavaScript engine?122123124