xml,json都有大量的库来解析,我们如何解析html呢?

TFHpple是一个小型的封装,可以用来解析html,它是对libxml的封装,语法是xpath。
今天我看到一个直接用libxml来解析html,参看:http://www.cocoanetics.com/2011/09/taming-html-parsing-with-libxml-1/#comment-3090 那张图画得一目了然,很值得收藏。这个文章中的源码不能遍历所有的html,我做了一点修改可以将html遍历打印出来

// NSData data contains the document data
// encoding is the NSStringEncoding of the data
// baseURL the documents base URL, i.e. location

CFStringEncoding cfenc = CFStringConvertNSStringEncodingToEncoding(encoding);
CFStringRef cfencstr = CFStringConvertEncodingToIANACharSetName(cfenc);
const char *enc = CFStringGetCStringPtr(cfencstr, 0);

htmlDocPtr _htmlDocument = htmlReadDoc([data bytes],
[[baseURL absoluteString] UTF8String],
enc,
XML_PARSE_NOERROR | XML_PARSE_NOWARNING);
if (_htmlDocument)
{
xmlFreeDoc(_htmlDocument);
}

xmlNodePtr currentNode = (xmlNodePtr)_htmlDocument;

while (currentNode)
{
// output node if it is an element

if (currentNode->type == XML_ELEMENT_NODE)
{
NSMutableArray *attrArray = [NSMutableArray array];

for (xmlAttrPtr attrNode = currentNode->properties; attrNode; attrNode = attrNode->next)
{
xmlNodePtr contents = attrNode->children;

[attrArray addObject:[NSString stringWithFormat:@"%s='%s'", attrNode->name, contents->content]];
}

NSString *attrString = [attrArray componentsJoinedByString:@" "];

if ([attrString length])
{
attrString = [@" " stringByAppendingString:attrString];
}

NSLog(@"<%s%@>", currentNode->name, attrString);
}
else if (currentNode->type == XML_TEXT_NODE)
{
//NSLog(@"%s", currentNode->content);
NSLog(@"%@", [NSString stringWithCString:(const char *)currentNode->content encoding:NSUTF8StringEncoding]);
}
else if (currentNode->type == XML_COMMENT_NODE)
{
NSLog(@"/* %s */", currentNode->name);
}

if (currentNode && currentNode->children)
{
currentNode = currentNode->children;
}
else if (currentNode && currentNode->next)
{
currentNode = currentNode->next;
}
else
{
currentNode = currentNode->parent;

// close node
if (currentNode && currentNode->type == XML_ELEMENT_NODE)
{
NSLog(@"</%s>", currentNode->name);
}

if (currentNode->next)
{
currentNode = currentNode->next;
}
else
{
while(currentNode)
{
currentNode = currentNode->parent;
if (currentNode && currentNode->type == XML_ELEMENT_NODE)
{
NSLog(@"</%s>", currentNode->name);
if (strcmp((const char *)currentNode->name, "table") == 0)
{
NSLog(@"over");
}
}

if (currentNode == nodes->nodeTab[0])
{
break;
}

if (currentNode && currentNode->next)
{
currentNode = currentNode->next;
break;
}
}
}
}

if (currentNode == nodes->nodeTab[0])
{
break;
}
}

不过我还是喜欢用TFHpple,因为它很简单,也好用,但是它的功能不是很完完善。比如,不能获取children node,我就写了两个方法,一个是获取children node,一个是获取所有的contents. 还有node的属性content的key与node's content的key一样,都是@"nodeContent", 正确情况下属性的应是@"attributeContent",
所以我写了这个方法,同时修改node属性的content key.
NSDictionary *DictionaryForNode2(xmlNodePtr currentNode, NSMutableDictionary *parentResult)
{
NSMutableDictionary *resultForNode = [NSMutableDictionary dictionary];

if (currentNode->name)
{
NSString *currentNodeContent =
[NSString stringWithCString:(const char *)currentNode->name encoding:NSUTF8StringEncoding];
[resultForNode setObject:currentNodeContent forKey:@"nodeName"];
}

if (currentNode->content)
{
NSString *currentNodeContent = [NSString stringWithCString:(const char *)currentNode->content encoding:NSUTF8StringEncoding];

if (currentNode->type == XML_TEXT_NODE)
{
if (currentNode->parent->type == XML_ELEMENT_NODE)
{
[parentResult setObject:currentNodeContent forKey:@"nodeContent"];
return nil;
}

if (currentNode->parent->type == XML_ATTRIBUTE_NODE)
{
[parentResult
setObject:
[currentNodeContent
stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]
forKey:@"attributeContent"];
return nil;

}
}
}

xmlAttr *attribute = currentNode->properties;
if (attribute)
{
NSMutableArray *attributeArray = [NSMutableArray array];
while (attribute)
{
NSMutableDictionary *attributeDictionary = [NSMutableDictionary dictionary];
NSString *attributeName =
[NSString stringWithCString:(const char *)attribute->name encoding:NSUTF8StringEncoding];
if (attributeName)
{
[attributeDictionary setObject:attributeName forKey:@"attributeName"];
}

if (attribute->children)
{
NSDictionary *childDictionary = DictionaryForNode2(attribute->children, attributeDictionary);
if (childDictionary)
{
[attributeDictionary setObject:childDictionary forKey:@"attributeContent"];
}
}

if ([attributeDictionary count] > 0)
{
[attributeArray addObject:attributeDictionary];
}
attribute = attribute->next;
}

if ([attributeArray count] > 0)
{
[resultForNode setObject:attributeArray forKey:@"nodeAttributeArray"];
}
}

xmlNodePtr childNode = currentNode->children;
if (childNode)
{
NSMutableArray *childContentArray = [NSMutableArray array];
while (childNode)
{
NSDictionary *childDictionary = DictionaryForNode2(childNode, resultForNode);
if (childDictionary)
{
[childContentArray addObject:childDictionary];
}
childNode = childNode->next;
}
if ([childContentArray count] > 0)
{
[resultForNode setObject:childContentArray forKey:@"nodeChildArray"];
}
}

return resultForNode;
}

TFHppleElement.m里加了两个key 常量
NSString * const TFHppleNodeAttributeContentKey = @"attributeContent";
NSString * const TFHppleNodeChildArrayKey = @"nodeChildArray";

并修改获取属性方法为:
- (NSDictionary *) attributes
{
NSMutableDictionary * translatedAttributes = [NSMutableDictionary dictionary];
for (NSDictionary * attributeDict in [node objectForKey:TFHppleNodeAttributeArrayKey]) {
[translatedAttributes setObject:[attributeDict objectForKey:TFHppleNodeAttributeContentKey]
forKey:[attributeDict objectForKey:TFHppleNodeAttributeNameKey]];
}
return translatedAttributes;
}

并添加获取children node 方法:
- (BOOL) hasChildren
{
NSArray *childs = [node objectForKey: TFHppleNodeChildArrayKey];

if (childs)
{
return YES;
}

return NO;
}

- (NSArray *) children
{
if ([self hasChildren])
return [node objectForKey: TFHppleNodeChildArrayKey];
return nil;
}

最新文章

  1. ZBrush中的头部模型该如何进行雕刻
  2. C#不对称加密
  3. 带节日和农历的js日历
  4. android测试点汇总
  5. Redis学习笔记(1)-Key
  6. 程序员书单_HeadFirst系列
  7. 禁止触屏滑动touchmove方法介绍
  8. POJ 2226 Muddy Fields (最小点覆盖集,对比POJ 3041)
  9. PSI在windows server2008服务器上的安装方法
  10. (转)RabbitMQ消息队列(九):Publisher的消息确认机制
  11. table总结insertRow、deleteRow
  12. 自定义Sharepoint的登陆页面
  13. Struts2 技术全总结 (正在更新)
  14. Echarts数据可视化series-radar雷达图,开发全解+完美注释
  15. centos6.6 7 vim编辑器中文乱码
  16. java静态代码块、普通代码
  17. AOP 如果被代理对象的方法设置了参数 而代理对象的前置方法没有设置参数 则无法拦截到
  18. CentOS 7下KVM支持虚拟化/嵌套虚拟化配置
  19. 带CookieContainer进行post
  20. CFGym 101161I 题解

热门文章

  1. chrome浏览器font-size&lt;12px无效解决办法
  2. iOS开发者必备的10款工具
  3. android多种布局的列表实现
  4. Spring-2-J Goblin Wars(SPOJ AMR11J)解题报告及测试数据
  5. JavaScript Patterns 4.4 Self-Defining Functions
  6. 【mysql】一个关于order by排序的问题
  7. redis动态修改参数配置
  8. spring ioc DI 理解
  9. C. Coloring Trees DP
  10. HDU 4941 Magical Forest --STL Map应用