Package com.norconex.importer.util
Class DOMUtil
- java.lang.Object
-
- com.norconex.importer.util.DOMUtil
-
public final class DOMUtil extends Object
Utility methods related to JSoup/DOM manipulation.- Since:
- 2.6.0
-
-
Field Summary
Fields Modifier and Type Field Description static String
PARSER_HTML
static String
PARSER_XML
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static String
getElementValue(org.jsoup.nodes.Element element, String extract)
Gets an element value based on JSoup DOM.static org.jsoup.parser.Parser
toJSoupParser(String parser)
Gets the JSoup parser associated with the string representation.
-
-
-
Field Detail
-
PARSER_HTML
public static final String PARSER_HTML
- Since:
- 2.8.0
- See Also:
- Constant Field Values
-
PARSER_XML
public static final String PARSER_XML
- Since:
- 2.8.0
- See Also:
- Constant Field Values
-
-
Method Detail
-
toJSoupParser
public static org.jsoup.parser.Parser toJSoupParser(String parser)
Gets the JSoup parser associated with the string representation. The string "xml" (case insensitive) will return the XML parser. Anything else will return the HTML parser.- Parameters:
parser
- "html" or "xml"- Returns:
- JSoup parser
- Since:
- 2.8.0
-
getElementValue
public static String getElementValue(org.jsoup.nodes.Element element, String extract)
Gets an element value based on JSoup DOM. You control what gets extracted exactly thanks to the "extract" argument. Possible values are:
- text: Default option when extract is blank. The text of the element, including combined children.
- html: Extracts an element inner HTML (including children).
- outerHtml: Extracts an element outer HTML (like "html", but includes the "current" tag).
- ownText: Extracts the text owned by this element only; does not get the combined text of all children.
- data: Extracts the combined data of a data-element (e.g. <script>).
- id: Extracts the ID attribute of the element (if any).
- tagName: Extract the name of the tag of the element.
- val: Extracts the value of a form element (input, textarea, etc).
- className: Extracts the literal value of the element's "class" attribute, which may include multiple class names, space separated.
- cssSelector: Extracts a CSS selector that will uniquely select (identify) this element.
- attr(attributeKey): Extracts the value of the element attribute matching your replacement for "attributeKey" (e.g. "attr(title)" will extract the "title" attribute).
Typically, when specified as an attribute, implementors can use the following:
extract="[text|html|outerHtml|ownText|data|tagName|val|className|cssSelector|attr(attributeKey)]"
- Parameters:
element
- the element to extract value onextract
- the type of extraction to perform- Returns:
- the element value
- See Also:
Element
-
-