丰满人妻一区二区三区免费视频,国产+人+亚洲

Home

Web Front-end

HTML Tutorial

Using XPATH and HTML Cleaner to parse HTML/XML_html/css_WEB-ITnose

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 24, 2016 am 11:51 AM

Using XPATH and HTML Cleaner to parse HTML/XML

The Beautiful Life of Sun Vulcan ()

This article follows the "Attribution-NonCommercial-Consistency" Creative Commons License

Please keep this sentence for reprinting: The Beautiful Life of the Sun Vulcan - This blog focuses on agile development and research on mobile and IoT devices : iOS, Android, Html5, Arduino, pcDuino, otherwise, the articles from this blog will not be reproduced or reprinted, thank you for your cooperation.

Using XPATH and HTML Cleaner to parse HTML/XML
JANUARY 5, 2010

tags: android, examples, HTML, parse, scraping, XML, XPATH

Hey everyone

Hey everyone,

Sometimes I find the ability to be very useful, especially in web-related applications, and that is to get HTML from a web site and parse data from the HTML, or whatever you want (for mine The case is mostly always data).

So something that I've found to be extremely useful (especially in web related applications) is the ability to retrieve HTML from websites and parse their HTML for data or whatever you may be looking for (in my case it is almost always data).

I actually use this technique to do the real time stock/option imports for my Black-Scholes/Implied Volatility applications, so if you're looking for an example on how to retrieve and parse HTML and run “queries” over it using, say, XPATH, then this post is for you.

Now, before we begin, in order to do this you will have to reference an external JAR in your project's build path. The JAR that I use comes from HtmlCleaner which even gives you an example of how they use it here HtmlCleaner Example, but in addition to that I'll show you an example of how I use it.

public class OptionScraper {

???? // EXAMPLE XPATH QUERIES IN THE FORM OF STRINGS - WILL BE USED LATER

???? private static final String NAME_XPATH = "http://div[@class='yfi_quote']/div[@class='hd']/h2" ;

???? private static final String TIME_XPATH = "http://table[@id='time_table']/tbody/tr/td[@class='yfnc_tabledata1']" ;

???? private static final String PRICE_XPATH = "http://table[@id='price_table']//tr//span" ;

???? // TAGNODE OBJECT, ITS USE WILL COME IN LATER

???? private static TagNode node;

???? // A METHOD THAT HELPS ME RETRIEVE THE STOCK OPTION'S DATA BASED OFF THE NAME (I.E. GOUAA IS ONE OF GOOGLE'S STOCK OPTIONS)

???? public static Option getOptionFromName(String name) throws XPatherException, ParserConfigurationException,SAXException, IOException, XPatherException {

???????? // THE URL WHOSE HTML I WANT TO RETRIEVE AND PARSE

???????? String option_url = " http://finance.yahoo.com/q?s=" name.toUpperCase();

???????? // THIS IS WHERE THE HTMLCLEANER COMES IN, I INITIALIZE IT HERE

???????? HtmlCleaner cleaner = new HtmlCleaner();

???????? CleanerProperties props = cleaner.getProperties();

???????? props.setAllowHtmlInsideAttributes( true );

???????? props.setAllowMultiWordAttributes( true );

???????? props.setRecognizeUnicodeChars( true );

???????? props.setOmitComments( true );

???????? // OPEN A CONNECTION TO THE DESIRED URL

???????? URL url = new URL(option_url);

???????? URLConnection conn = url.openConnection();

???????? //USE THE CLEANER TO "CLEAN" THE HTML AND RETURN IT AS A TAGNODE OBJECT

???????? node = cleaner.clean( new InputStreamReader(conn.getInputStream()));

???????? // ONCE THE HTML IS CLEANED, THEN YOU CAN RUN YOUR XPATH EXPRESSIONS ON THE NODE, WHICH WILL THEN RETURN AN ARRAY OF TAGNODE OBJECTS (THESE ARE RETURNED AS OBJECTS BUT GET CASTED BELOW)

???????? Object[] info_nodes = node.evaluateXPath(NAME_XPATH);

???????? Object[] time_nodes = node.evaluateXPath(TIME_XPATH);

???????? Object[] price_nodes = node.evaluateXPath(PRICE_XPATH);

???????? // HERE I JUST DO A SIMPLE CHECK TO MAKE SURE THAT MY XPATH WAS CORRECT AND THAT AN ACTUAL NODE(S) WAS RETURNED

???????? if (info_nodes.length > 0 ) {

???????????? // CASTED TO A TAGNODE

???????????? TagNode info_node = (TagNode) info_nodes[ 0 ];

???????????? // HOW TO RETRIEVE THE CONTENTS AS A STRING

???????????? String info = info_node.getChildren().iterator().next().toString().trim();

???????????? // SOME METHOD THAT PROCESSES THE STRING OF INFORMATION (IN MY CASE, THIS WAS THE STOCK QUOTE, ETC)

???????????? processInfoNode(o, info);

???????? }

???????? if (time_nodes.length > 0 ) {

???????????? TagNode time_node = (TagNode) time_nodes[ 0 ];

???????????? String date = time_node.getChildren().iterator().next().toString().trim();

???????????? // DATE RETURNED IN 15-JAN-10 FORMAT, SO THIS IS SOME METHOD I WROTE TO JUST PARSE THAT STRING INTO THE FORMAT THAT I USE

???????????? processDateNode(o, date);

???????? }

???????? if (price_nodes.length > 0 ) {

???????????? TagNode price_node = (TagNode) price_nodes[ 0 ];

???????????? double price = Double.parseDouble(price_node.getChildren().iterator().next().toString().trim());

???????????? o.setPremium(price);

???????? }

???????? return o;

???? }

}

So that’s it! Once you include the JAR in your build path, everything else is pretty easy! It’s a great tool to use. However, it does require knowledge of?XPATH?but XPATH isn’t too hard to pick up and is useful to know so if you don’t know it then take a look at the link.

Now, a warning to everyone. It’s documented that the XPATH expressions recognized by HtmlCleaner is not complete in the sense that only “basic” XPATH is recognized. What’s excluded? For instance, you can’t use any of the “axes” operators (i.e. parent, ancestor, following, following-sibling, etc), but in my experience everything else is fair game. Yes, it sucks, and many times it can make your life a little bit harder, but usually it just requires you to be a tad more clever with your XPATH expressions before you can pull the desired information.

And of course, this technique works for XML documents as well!

Hope this was helpful to everyone. Let me know if you’re confused anywhere.

- jwei

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress images for free

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Grass Wonder Build Guide | Uma Musume Pretty Derby

1 months ago By Jack chen

Roblox: 99 Nights In The Forest - All Badges And How To Unlock Them

4 weeks ago By DDD

Uma Musume Pretty Derby Banner Schedule (July 2025)

1 months ago By Jack chen

RimWorld Odyssey Temperature Guide for Ships and Gravtech

3 weeks ago By Jack chen

Windows Security is blank or not showing options

1 months ago By 下次還敢

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Laravel Tutorial

1600

PHP Tutorial

1502

276

Related knowledge

Implementing Clickable Buttons Using the HTML button Element Jul 07, 2025 am 02:31 AM

To use HTML button elements to achieve clickable buttons, you must first master its basic usage and common precautions. 1. Create buttons with tags and define behaviors through type attributes (such as button, submit, reset), which is submitted by default; 2. Add interactive functions through JavaScript, which can be written inline or bind event listeners through ID to improve maintenance; 3. Use CSS to customize styles, including background color, border, rounded corners and hover/active status effects to enhance user experience; 4. Pay attention to common problems: make sure that the disabled attribute is not enabled, JS events are correctly bound, layout occlusion, and use the help of developer tools to troubleshoot exceptions. Master this

Configuring Document Metadata Within the HTML head Element Jul 09, 2025 am 02:30 AM

Metadata in HTMLhead is crucial for SEO, social sharing, and browser behavior. 1. Set the page title and description, use and keep it concise and unique; 2. Add OpenGraph and Twitter card information to optimize social sharing effects, pay attention to the image size and use debugging tools to test; 3. Define the character set and viewport settings to ensure multi-language support is adapted to the mobile terminal; 4. Optional tags such as author copyright, robots control and canonical prevent duplicate content should also be configured reasonably.

Best HTML tutorial for beginners in 2025 Jul 08, 2025 am 12:25 AM

TolearnHTMLin2025,chooseatutorialthatbalanceshands-onpracticewithmodernstandardsandintegratesCSSandJavaScriptbasics.1.Prioritizehands-onlearningwithstep-by-stepprojectslikebuildingapersonalprofileorbloglayout.2.EnsureitcoversmodernHTMLelementssuchas,

HTML for email templates tutorial Jul 10, 2025 pm 02:01 PM

How to make HTML mail templates with good compatibility? First, you need to build a structure with tables to avoid using div flex or grid layout; secondly, all styles must be inlined and cannot rely on external CSS; then the picture should be added with alt description and use a public URL, and the buttons should be simulated with a table or td with background color; finally, you must test and adjust the details on multiple clients.

How to associate captions with images or media using the html figure and figcaption elements? Jul 07, 2025 am 02:30 AM

Using HTML sums allows for intuitive and semantic clarity to add caption text to images or media. 1. Used to wrap independent media content, such as pictures, videos or code blocks; 2. It is placed as its explanatory text, and can be located above or below the media; 3. They not only improve the clarity of the page structure, but also enhance accessibility and SEO effect; 4. When using it, you should pay attention to avoid abuse, and apply to content that needs to be emphasized and accompanied by description, rather than ordinary decorative pictures; 5. The alt attribute that cannot be ignored, which is different from figcaption; 6. The figcaption is flexible and can be placed at the top or bottom of the figure as needed. Using these two tags correctly helps to build semantic and easy to understand web content.

How to handle forms submission in HTML without a server? Jul 09, 2025 am 01:14 AM

When there is no backend server, HTML form submission can still be processed through front-end technology or third-party services. Specific methods include: 1. Use JavaScript to intercept form submissions to achieve input verification and user feedback, but the data will not be persisted; 2. Use third-party serverless form services such as Formspree to collect data and provide email notification and redirection functions; 3. Use localStorage to store temporary client data, which is suitable for saving user preferences or managing single-page application status, but is not suitable for long-term storage of sensitive information.

What are the most commonly used global attributes in html? Jul 10, 2025 am 10:58 AM

class, id, style, data-, and title are the most commonly used global attributes in HTML. class is used to specify one or more class names to facilitate style setting and JavaScript operations; id provides unique identifiers for elements, suitable for anchor jumps and JavaScript control; style allows for inline styles to be added, suitable for temporary debugging but not recommended for large-scale use; data-properties are used to store custom data, which is convenient for front-end and back-end interaction; title is used to add mouseover prompts, but its style and behavior are limited by the browser. Reasonable selection of these attributes can improve development efficiency and user experience.

Implementing Native Lazy Loading for Images in HTML Jul 12, 2025 am 12:48 AM

Native lazy loading is a built-in browser function that enables lazy loading of pictures by adding loading="lazy" attribute to the tag. 1. It does not require JavaScript or third-party libraries, and is used directly in HTML; 2. It is suitable for pictures that are not displayed on the first screen below the page, picture gallery scrolling add-ons and large picture resources; 3. It is not suitable for pictures with first screen or display:none; 4. When using it, a suitable placeholder should be set to avoid layout jitter; 5. It should optimize responsive image loading in combination with srcset and sizes attributes; 6. Compatibility issues need to be considered. Some old browsers do not support it. They can be used through feature detection and combined with JavaScript solutions.

See all articles

国产av日韩一区二区三区精品,成人性爱视频在线观看,国产,欧美,日韩,一区,www.成色av久久成人,2222eeee成人天堂

Using XPATH and HTML Cleaner to parse HTML/XML_html/css_WEB-ITnose

Hot AI Tools

Undress AI Tool

Undresser.AI Undress

AI Clothes Remover

Clothoff.io

Video Face Swap

Hot Article

Hot Tools

Notepad++7.3.1

SublimeText3 Chinese version

Zend Studio 13.0.1

Dreamweaver CS6

SublimeText3 Mac version

Hot Topics