项目须要用脚本生成sitemap,中间学习了一下sitemap的格式和lxml库的使用方法。把结果记录一下,方便以后须要直接拿来用。

来自Python脚本生成sitemap


安装lxml

首先须要pip install lxml安装lxml库。

假设你在ubuntu上遇到了下面错误:

#include "libxml/xmlversion.h"

compilation terminated.

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

----------------------------------------
Cleaning up...
Removing temporary dir /tmp/pip_build_root...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-O4cIn6-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_root/lxml
Exception information:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 122, in main
status = self.run(options, args)
File "/usr/lib/python2.7/dist-packages/pip/commands/install.py", line 283, in run
requirement_set.install(install_options, global_options, root=options.root_path)
File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1435, in install
requirement.install(install_options, global_options, *args, **kwargs)
File "/usr/lib/python2.7/dist-packages/pip/req.py", line 706, in install
cwd=self.source_dir, filter_stdout=self._filter_install, show_stdout=False)
File "/usr/lib/python2.7/dist-packages/pip/util.py", line 697, in call_subprocess
% (command_desc, proc.returncode, cwd))
InstallationError: Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/lxml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-O4cIn6-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_root/lxml

请安装下面依赖:

sudo apt-get install libxml2-dev libxslt1-dev


Python代码

下面是生成sitemap和sitemapindex索引的代码。能够依照需求传入须要的參数。或者添加字段:

#!/usr/bin/env python
# -*- coding:utf-8 -*- import io
import re
from lxml import etree def generate_xml(filename, url_list):
"""Generate a new xml file use url_list"""
root = etree.Element('urlset',
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
for each in url_list:
url = etree.Element('url')
loc = etree.Element('loc')
loc.text = each
url.append(loc)
root.append(url) header = u'<? xml version="1.0" encoding="UTF-8"? >\n'
s = etree.tostring(root, encoding='utf-8', pretty_print=True)
with io.open(filename, 'w', encoding='utf-8') as f:
f.write(unicode(header+s)) def update_xml(filename, url_list):
"""Add new url_list to origin xml file."""
f = open(filename, 'r')
lines = [i.strip() for i in f.readlines()]
f.close() old_url_list = []
for each_line in lines:
d = re.findall('<loc>(http:\/\/.+)<\/loc>', each_line)
old_url_list += d
url_list += old_url_list generate_xml(filename, url_list) def generatr_xml_index(filename, sitemap_list, lastmod_list):
"""Generate sitemap index xml file."""
root = etree.Element('sitemapindex',
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
for each_sitemap, each_lastmod in zip(sitemap_list, lastmod_list):
sitemap = etree.Element('sitemap')
loc = etree.Element('loc')
loc.text = each_sitemap
lastmod = etree.Element('lastmod')
lastmod.text = each_lastmod
sitemap.append(loc)
sitemap.append(lastmod)
root.append(sitemap) header = u'<? xml version="1.0" encoding="UTF-8"? >\n'
s = etree.tostring(root, encoding='utf-8', pretty_print=True)
with io.open(filename, 'w', encoding='utf-8') as f:
f.write(unicode(header+s)) if __name__ == '__main__':
urls = ['http://www.baidu.com'] * 10
mods = ['2004-10-01T18:23:17+00:00'] * 10
generatr_xml_index('index.xml', urls, mods)

效果

生成的效果应该是这样的格式:

sitemap格式:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/foo.html</loc>
</url>
</urlset>

sitemapindex格式:

<?

xml version="1.0" encoding="UTF-8"?

>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.example.com/sitemap1.xml.gz</loc>
<lastmod>2004-10-01T18:23:17+00:00</lastmod>
</sitemap>
<sitemap>
<loc>http://www.example.com/sitemap2.xml.gz</loc>
<lastmod>2005-01-01</lastmod>
</sitemap>
</sitemapindex>

lastmod时间格式的问题

格式是用ISO 8601的标准,假设是linux/unix系统,能够使用下面函数获取

def get_lastmod_time(filename):
time_stamp = os.path.getmtime(filename)
t = time.localtime(time_stamp)
return time.strftime('%Y-%m-%dT%H:%M:%S+08:00', t)

Ref:

创建站点地图

sitemap

lxml

谷歌(Google)站点地图的XML文档格式说明

Google Sitemap文件格式

最新文章

  1. realmswift的使用
  2. ARP协议
  3. myeclipse-建立webservice服务端和客户端
  4. Oracle连接字符串C#
  5. MSP430F5438内部延时函数的用法
  6. Power Designer 使用技巧总结
  7. DAS 原文出自【比特网】
  8. Leveraging the Power of Asynchrony in ASP.NET [转]
  9. css直接写出小三角
  10. poj 1141 Brackets Sequence(区间DP)
  11. PHP导出生成CSV文件
  12. Spring_xml和注解混合方式开发
  13. atitit r9 doc on home ntpc .docx
  14. Lucene的基本使用
  15. iOS-原生纯代码约束总结(一)之 AutoresizingMask
  16. 用宏实现C/C++从非零整数开始的数组
  17. 从头认识java-18.2 主要的线程机制(4)-优先级
  18. SIP 认证方式
  19. 【SSH网上商城项目实战19】订单信息的级联入库以及页面的缓存问题
  20. Uoj 441 保卫王国

热门文章

  1. 基于设备树的TQ2440 DMA学习(1)—— 芯片手册
  2. 总结ASP.NET MVC视图页使用jQuery传递异步数据的几种方式
  3. 在ASP.NET MVC中实现登录后回到原先的界面
  4. 委托、Lambda表达式、事件系列04,委托链是怎样形成的, 多播委托, 调用委托链方法,委托链异常处理
  5. java List转换为字符串并加入分隔符的一些方法总结
  6. Tomcat集群下获取memcached缓存对象数量,统计在线用户数据量
  7. java如何直接返回excel到客户端
  8. Excel 2016 Power View选项卡不显示的问题
  9. Chrome浏览器导出pdf时,隐藏链接HREF
  10. Newtonsoft.Json高级用法DataContractJsonSerializer,JavaScriptSerializer 和 Json.NET即Newtonsoft.Json datatable,dataset,modle,序列化