最新文章专题视频专题问答1问答10问答100问答1000问答2000关键字专题1关键字专题50关键字专题500关键字专题1500TAG最新视频文章推荐1 推荐3 推荐5 推荐7 推荐9 推荐11 推荐13 推荐15 推荐17 推荐19 推荐21 推荐23 推荐25 推荐27 推荐29 推荐31 推荐33 推荐35 推荐37视频文章20视频文章30视频文章40视频文章50视频文章60 视频文章70视频文章80视频文章90视频文章100视频文章120视频文章140 视频2关键字专题关键字专题tag2tag3文章专题文章专题2文章索引1文章索引2文章索引3文章索引4文章索引5123456789101112131415文章专题3
当前位置: 首页 - 科技 - 知识百科 - 正文

使用XPATH和HTMLCleaner解析HTML/XML(UsingXPATHandHTMLCleanertoparseHTML/XML)_html/css_WEB-ITnose

来源:懂视网 责编:小采 时间:2020-11-27 16:00:52
文档

使用XPATH和HTMLCleaner解析HTML/XML(UsingXPATHandHTMLCleanertoparseHTML/XML)_html/css_WEB-ITnose

使用XPATH和HTMLCleaner解析HTML/XML(UsingXPATHandHTMLCleanertoparseHTML/XML)_html/css_WEB-ITnose:使用 XPATH 和 HTML Cleaner 解析 HTML/XML(Using XPATH and HTML Cleaner to parse HTML / XML) 太阳火神的美丽人生 () 本文遵循署名-非商业用途-保持一致创作公用协议 转载请保留此句:太阳火神的美丽人生 - 本博客专注于 敏捷开发及移动和物联
推荐度:
导读使用XPATH和HTMLCleaner解析HTML/XML(UsingXPATHandHTMLCleanertoparseHTML/XML)_html/css_WEB-ITnose:使用 XPATH 和 HTML Cleaner 解析 HTML/XML(Using XPATH and HTML Cleaner to parse HTML / XML) 太阳火神的美丽人生 () 本文遵循署名-非商业用途-保持一致创作公用协议 转载请保留此句:太阳火神的美丽人生 - 本博客专注于 敏捷开发及移动和物联

使用 XPATH 和 HTML Cleaner 解析 HTML/XML
(Using XPATH and HTML Cleaner to parse HTML / XML)

太阳火神的美丽人生 ()

本文遵循“署名-非商业用途-保持一致”创作公用协议

转载请保留此句:太阳火神的美丽人生 - 本博客专注于 敏捷开发及移动和物联设备研究:iOS、Android、Html5、Arduino、pcDuino,否则,出自本博客的文章拒绝转载或再转载,谢谢合作。



使用 XPATH 和 HTML Cleaner 解析 HTML/XML
(Using XPATH and HTML Cleaner to parse HTML / XML)

JANUARY 5, 2010

tags: android, examples, HTML, parse, scraping, XML, XPATH

大家好
Hey everyone,

有时我发现有一种能力十分有用,尤其在 Web 相关的应用中,那就是从 web 站点获取 HTML 并且从 HTML 解析数据,或是任何你要想得到的内容(对于我的情况大多总是数据)。
So something that I’ve found to be extremely useful (especially in web related applications) is the ability to retrieve HTML from websites and parse their HTML for data or whatever you may be looking for (in my case it is almost always data).


I actually use this technique to do the real time stock/option imports for my Black-Scholes/Implied Volatility applications, so if you’re looking for an example on how to retrieve and parse HTML and run “queries” over it using, say, XPATH, then this post is for you.

Now, before we begin, in order to do this you will have to reference an external JAR in your project’s build path. The JAR that I use comes from HtmlCleaner which even gives you an example of how they use it here HtmlCleaner Example, but in addition to that I’ll show you an example of how I use it.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

public class OptionScraper {

// EXAMPLE XPATH QUERIES IN THE FORM OF STRINGS - WILL BE USED LATER

private static final String NAME_XPATH = "//div[@class='yfi_quote']/div[@class='hd']/h2" ;

private static final String TIME_XPATH = "//table[@id='time_table']/tbody/tr/td[@class='yfnc_tabledata1']" ;

private static final String PRICE_XPATH = "//table[@id='price_table']//tr//span" ;

// TAGNODE OBJECT, ITS USE WILL COME IN LATER

private static TagNode node;

// A METHOD THAT HELPS ME RETRIEVE THE STOCK OPTION'S DATA BASED OFF THE NAME (I.E. GOUAA IS ONE OF GOOGLE'S STOCK OPTIONS)

public static Option getOptionFromName(String name) throws XPatherException, ParserConfigurationException,SAXException, IOException, XPatherException {

// THE URL WHOSE HTML I WANT TO RETRIEVE AND PARSE

String option_url = " http://finance.yahoo.com/q?s=" + name.toUpperCase();

// THIS IS WHERE THE HTMLCLEANER COMES IN, I INITIALIZE IT HERE

HtmlCleaner cleaner = new HtmlCleaner();

CleanerProperties props = cleaner.getProperties();

props.setAllowHtmlInsideAttributes( true );

props.setAllowMultiWordAttributes( true );

props.setRecognizeUnicodeChars( true );

props.setOmitComments( true );

// OPEN A CONNECTION TO THE DESIRED URL

URL url = new URL(option_url);

URLConnection conn = url.openConnection();

//USE THE CLEANER TO "CLEAN" THE HTML AND RETURN IT AS A TAGNODE OBJECT

node = cleaner.clean( new InputStreamReader(conn.getInputStream()));

// ONCE THE HTML IS CLEANED, THEN YOU CAN RUN YOUR XPATH EXPRESSIONS ON THE NODE, WHICH WILL THEN RETURN AN ARRAY OF TAGNODE OBJECTS (THESE ARE RETURNED AS OBJECTS BUT GET CASTED BELOW)

Object[] info_nodes = node.evaluateXPath(NAME_XPATH);

Object[] time_nodes = node.evaluateXPath(TIME_XPATH);

Object[] price_nodes = node.evaluateXPath(PRICE_XPATH);

// HERE I JUST DO A SIMPLE CHECK TO MAKE SURE THAT MY XPATH WAS CORRECT AND THAT AN ACTUAL NODE(S) WAS RETURNED

if (info_nodes.length > 0 ) {

// CASTED TO A TAGNODE

TagNode info_node = (TagNode) info_nodes[ 0 ];

// HOW TO RETRIEVE THE CONTENTS AS A STRING

String info = info_node.getChildren().iterator().next().toString().trim();

// SOME METHOD THAT PROCESSES THE STRING OF INFORMATION (IN MY CASE, THIS WAS THE STOCK QUOTE, ETC)

processInfoNode(o, info);

}

if (time_nodes.length > 0 ) {

TagNode time_node = (TagNode) time_nodes[ 0 ];

String date = time_node.getChildren().iterator().next().toString().trim();

// DATE RETURNED IN 15-JAN-10 FORMAT, SO THIS IS SOME METHOD I WROTE TO JUST PARSE THAT STRING INTO THE FORMAT THAT I USE

processDateNode(o, date);

}

if (price_nodes.length > 0 ) {

TagNode price_node = (TagNode) price_nodes[ 0 ];

double price = Double.parseDouble(price_node.getChildren().iterator().next().toString().trim());

o.setPremium(price);

}

return o;

}

}

So that’s it! Once you include the JAR in your build path, everything else is pretty easy! It’s a great tool to use. However, it does require knowledge of XPATH but XPATH isn’t too hard to pick up and is useful to know so if you don’t know it then take a look at the link.

Now, a warning to everyone. It’s documented that the XPATH expressions recognized by HtmlCleaner is not complete in the sense that only “basic” XPATH is recognized. What’s excluded? For instance, you can’t use any of the “axes” operators (i.e. parent, ancestor, following, following-sibling, etc), but in my experience everything else is fair game. Yes, it sucks, and many times it can make your life a little bit harder, but usually it just requires you to be a tad more clever with your XPATH expressions before you can pull the desired information.

And of course, this technique works for XML documents as well!

Hope this was helpful to everyone. Let me know if you’re confused anywhere.

- jwei

文档

使用XPATH和HTMLCleaner解析HTML/XML(UsingXPATHandHTMLCleanertoparseHTML/XML)_html/css_WEB-ITnose

使用XPATH和HTMLCleaner解析HTML/XML(UsingXPATHandHTMLCleanertoparseHTML/XML)_html/css_WEB-ITnose:使用 XPATH 和 HTML Cleaner 解析 HTML/XML(Using XPATH and HTML Cleaner to parse HTML / XML) 太阳火神的美丽人生 () 本文遵循署名-非商业用途-保持一致创作公用协议 转载请保留此句:太阳火神的美丽人生 - 本博客专注于 敏捷开发及移动和物联
推荐度:
标签: html xml use
  • 热门焦点

最新推荐

猜你喜欢

热门推荐

专题
Top