问题:例如清除在web页面表单中填入了pýtĥöñis这样的文本

解决方法:str.translate()方法

s = 'p\xfdt\u0125\xf6\xf1\x0cis\tawesome\r\n'
print(s) # (a) Remapping whitespace 先建立一个小型的转换表,然后使用translate()方法
remap = {
ord('\t') : ' ',
ord('\f') : ' ',
ord('\r') : None # Deleted
} a = s.translate(remap)
print('whitespace remapped:', a)
print ('------------------------------')

aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAokAAACYCAIAAAAHoORhAAALK0lEQVR4nO3dv5LctgEHYDyVCr5HEiuxZzSTGHbiVElsJXkBtWzduXOhnqUfQS/APjPu/AZIsVwS5ALcP7o9YU/fNyp0WC5AkDj+DiD3Lnz19//++OOPCQBoQ5DNANAU2QwAbZHNANAW2QwAbZHNANAW2QwAbZHNANAW2QwAbZHNANAW2QwAbZHNANAW2QwAbZHNANAW2QwAbZHNANAW2QzNGGLIdf2heOy7sHF8KaW0vBhj7Pry9qHrx3UDWQ1XtDvVA9xXLZuHn0PoC9+F15YDVxj7bk6/IS5JmJcfvihuH47ZPEdxHFJKKX/34eUhrlK22O6ZeoC72Zs3/9KFt6H75eRb8dryU6Wfx0PXj8qVfw7l+98aqww+ROJSPsR8vrvZ/sScqZu3xOFQ05gXFtvdqae4800dZ+XKn7P8/HfINc6taY99H8Lb9eXglnLgAnlG9t2SiMvl4OSba7lQnLxUzNRj8K5ivtbuTj3A/Vx0v/nXvntbWrK+tnzW2s87ypU/Z/nO91r+ltK8djtv3r53/WohU7Oi/NVau9V6zu18C8dZufLnLD//HXKN89n8a98V58HXlgNnLfPXIYYsEPfXro+2GXqaqZunvsJ2zXzbbq0e4K52s9mCNjyvPIPHbHF5k83LunT+8/r5+8Tbaff8hlq7lXqA+9p7Trv0YNe15cDF5lntMUH7LoSuL62hzdlcXldbvyUOaX3Lelw31pfbLdcD3J/PNwNAW2QzALRFNgNAW2QzALRFNgNAW2QzALRFNgNAW2QzALRFNgNAW/z9ZgBoy6f5+80AQI2/3wwAbfk0f78ZAKjx95sBoC3+fjMAtMXfbwaAtvh8MwC0RTYDQFtkMwC0RTYDQFtkMwC05Zmyue9SCGm4dzMA8PieI5tjSF2f0pBCSFE+A8Cuj8rmGNI9fi/nEFMIqR9T392lfgBoWfjquxuzOV4wCR77vbnyVMOYujzjx9R10yQ7xFU9m3/57x8bYvLryAB4GW7P5lN9V7ijPPZPsI5dqGQQxgC8TOVsPjy61Wez1WleO6Yu+3KaznZL+fxvjtKxT7FfXh1OWsnLp3l2LJTXsnmeUp9G9VJ/l7ruY4/UcxsKKwRTj47LCYfF/3DoWn4K1p2tHYft8T/U0KV4KBnWp6xePwBPqzpvXj1ZPaZu/v96wprPlWvz5jzL85hZ6hmWy/1h+zmS5wDenzfnNZ+WDPEB42RYHdip+2Pq4nqrOB3b/N782K+OZ/E4lI//mLrDz1XDdBbmI1+rH4Ant5fNeRzm6bhkcD2ni29ctq9nTHH7G7I5pWn+dzpffwzrpYjNwe+7KR3j6aT5pMuF41A7/uNxYj1Ms/PpyO/WD8DT2svm/BnpYmRuwri1bM43frh5Xv4MfN79w4GKXRpi6oflMMZLOpjPj6/K5gvrB+AphC931rTX9ya3MXzyNNYS58NqXbqQzSfZH7M12CfJ5lX9w3KP9jGM2Y3heZ35YEgxTp8X77pVZhd/Oqkdh/Lxr2ezJ+EBns3umnb2TNb2ujyUVjWH7YLn/KDWfAtzqaq0ZlvbfuczVLGy1po/6PSIC7B5lw8nYrnlfJxSbz5fXuxy9TicHv+5JE4Hfz7y8y3nhz6kAI/i0jXtLR9hAoD72PsMVfHTMkNcT7YAgCf1lL97BAD4eLIZANri7zcDQFtkMwC0RTYDQFtkMwC0RTYDQFtkMwC0RTYDQFtkMwC0RTYDQFtkMwC0RTYDQFtkMwC0RTYDQFtkMwC0RTYDQFtkMwC0RTYDQFtkM3fUdymENHzq3QB4LLKZe4khdX1KQwohRfkMcDHZzNViSP349NUOMYWQ+jH13V3qB3gU4avvrsjmsT8zBxpi6vqn2TNuFkMK3R0rPzsJ3h8nUw1j6vKMH1PXTZPsEC/dGeMNeJGuy+aU0thfvT7Zd+44Prd4t2w+VTy/N4wTAA7K2Xx4hGeevhwWGw9TsbFPsU9dSCGsHvM5TJVCWM9jxmXLw7/lep2/tA6SqfWQQpe6binp+6WeZb41LIWbKdS02yGFLlsmrbdbdGi665cODrX9P5R0KR42G9bbl/az1q+9/tb2PyuPwzab4/pk7Xe20O6x8sOX06Ho9s5vbZyk/Pwey6d5dtyW7yiPt039x/Gzp35eiuP/qnFb7u/+OLlyfAIvUmXePKZuva44xNV1ef7/5sp4WpIq86r8nuLYL5ehvIYhLuWrJ37H1GWZN1fed0s2DHG5tub7XGt3z5BdKIepiXI9Y+oO+TRMe7tMHyv7WevXTnlx//N7wKdr2hdm8067aVid2fyc1ubNxXHSd1k9w+q851F94Ur1/gjMx09V8bzUx/8N47bQ3/o4uWV8Ai9OdU37cMHtu+nqELNr0DL3HW7N5pP5Vp4csVSY59lqN9ZVzduUF3V326066Wa1nvE4YRqmnwzO7metX+XyWrvDyXLFrdf06nHOz2M9p4tvXLbf/ZlvZ1zVFMdbcfzUq6iel8L4v3bc1vpbGye3jU/gxalm8+EiErs0xNQPyyXmabL5whuiw2r+kT+7O+9GPl/M961W/y03YktRUc3+SjbX9rPWr2p/i+0+aTYX281b2ZzQ1rI5q/38vLN2Xmrj/7pxe202P++DAkCz6s+CDSnG6fOpXTaXui2bp8vfsMwDak/YrrJhWNal5xnM/OVqnpqydcJS/fPTxbc82VuKinI9O/Ohyn6W+1Uvr+1/zCZYp2u5161pl9pdvjw5GsXzWxsnm+y/cD2m5nS81cZP5f3V81Ib/zeM20J/69nsyXMg7WVz9hGXuHkIKLtVNj9BE3fW4obyAl3+jExY32ctr2lnzwrN168xe3DpsMHmVu5mrXKniaJNv/bqmRck49Tl+QmyOFT3s9avWnl1/7ODHOL2lvNV2Vxrd25lW8/J+d0ZJ8U15L3tLzsvZ8dPTXX8lMb/fIiuaPe0v7vj5IYuAC/P1Z+h+lRe6u+jqPXrU/X3TLsXz2gBuNljZHP+6ZSXpNavT9XfnXaHuJ78AXA3fmcnALRFNgNAW2QzALRFNgNAW2QzALRFNgNAW2QzALRFNgNAW2QzALRFNgNAW2QzALRFNgNAW2QzALRFNgNAW2QzALRFNgNAW2QzALQlfPWdbAaAhoQvZTMAtCR8+d1/ZDMAtCN8+TfZDAANCX/6679lMwC0I/zxr29lMwC0I7z+9gfZDADtCK+/+V42A0A7whfxX7IZANpRzeYYUlj/G5Qrf9hygAdi3gwAbZHNANAW2QwAbZHNANAW2QwAbZHNANAW2QwAbZHNANAW2QwAbZHNANAW2QwAbZHNANCWWjYPP4fQ9+NHlwMA19mbN//Shbeh++UkcK8tBwAuF76I/9xb0x77PoS3Xf+x5QDAZc5lc0oppV/77m1pyfracgDgrPDF12ey+de+K86Dry0HAC6xm80WtAHg2dWyefg5FB/surYcALiOzzcDQFtkMwC0Jbz+5nvZDADtkM0A0Jbw+tsfZDMAtCO8/vaH/wEAzQivv/nhNwD4bPz0vFJK174lvP7m+099lADg+fz000/P2dwhm696S/giymYAPiMPkc3/utPeAECDHiCb//D1P++0NwDQoAfI5t//5R932hsAaNADZPPv/iybAfiMHJPyw7tX4c37j6jo/ZswOVbz4d2rqeTVuw/TVifZvLyt1rxsBuDz8jTZ/P5Nlr9z0bHC7NVVNr9/s0rk92+2dfz222+yGYDPzVNk84d3r05SdZXWywZZNmfhvSv87i+eBQPgM1LP5vdvwqt3796sVqU/vHt1MkE+JO9xw+PL22yeKl+yuTDVzpteVrr/DwfTeMTKds3cAAAAAElFTkSuQmCC" alt="" />

可以看到,类似\t和\f这样的空格符已经被重新映射成一个单独的空格。回车符\r已经完全被删除掉了。

利用重新映射的思想进一步构建出更加庞大的转换表,例如,我们把所有的Unicode组合字符都去掉;

s = 'p\xfdt\u0125\xf6\xf1\x0cis\tawesome\r\n'
print(s) # (a) Remapping whitespace
remap = {
ord('\t') : ' ', #ord(c):返回单字符c在ASCII中对应的整数,例如ord('a')->97
ord('\f') : ' ',
ord('\r') : None # Deleted
} a = s.translate(remap) # (b) Remove all combining characters/marks
import unicodedata
import sys
cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode)
if unicodedata.combining(chr(c))) #dict.fromkeys()方法构建了一个将每个Unicode组合字符都映射为None的字典。 b = unicodedata.normalize('NFD', a) #原始输入被转换为分离的形式
c = b.translate(cmb_chrs) #删除所有的重音符号
print('accents removed:', c)
>>> ================================ RESTART ================================
>>>
pýtĥöñis awesome accents removed: python is awesome >>>
另一种用来清理文本的技术涉及I/O解码和编码函数。
大致思路:首先对文本做初步的清理,然后通过结合encode()和decode()操作来修改或清理文本。
s = 'p\xfdt\u0125\xf6\xf1\x0cis\tawesome\r\n'
print(s) # (a) Remapping whitespace
remap = {
ord('\t') : ' ',
ord('\f') : ' ',
ord('\r') : None # Deleted
} a = s.translate(remap) # (b) Remove all combining characters/marks
import unicodedata
import sys
cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode)
if unicodedata.combining(chr(c))) b = unicodedata.normalize('NFD', a) # (c) Accent removal using I/O decoding
d = b.encode('ascii','ignore').decode('ascii')
print('accents removed via I/O:', d)
>>> ================================ RESTART ================================
>>>
pýtĥöñis awesome accents removed via I/O: python is awesome >>>

很显然,这种方法只有当我们的最终目标就是ASCII形式的文本时才有用。

补充:

文本清理和过滤的一个主要问题是运行时的性能问题,对于简单的操作,str.replace()通常是最快的方式,即使必须多次调用它也是如此。比如要清理掉空格符,可以编写如下的代码:

def clean_spaces(s):
s=s.replace('\r','')
s=s.replace('\t',' ')
s=s.replace('\f',' ')
return s

如果需要做高级操作,比如字符到字符的重映射或删除,那么translate()方法还是比较快的。

最新文章

  1. SQL语句经典大全
  2. 10 行 Python 代码写的模糊查询
  3. LeetCode 319
  4. C语言中short的意思
  5. Landsat元数据批量下载工具
  6. 【翻译】ASP.NET Web API是什么?
  7. mongodb时间戳转换成格式化时间戳
  8. Python爬虫小实践:爬取任意CSDN博客所有文章的文字内容(或可改写为保存其他的元素),间接增加博客访问量
  9. Python入门学习(二)
  10. JavaWeb学习总结(二)——Tomcat服务器学习和使用(一)(转)
  11. 使用TensorFlow实现回归预测
  12. Active information gathering-services enumeration
  13. 51nod 省选联测 R2
  14. 使用GetAdaptersInfo时,网卡类型的值为71
  15. 物联网架构成长之路(23)-Docker练习之Elasticsearch服务搭建
  16. trackViewer 氨基酸位点变异位置图谱展示
  17. mysql 的 docker 镜像使用
  18. 高效Java敏感词、关键词过滤工具包_过滤非法词句
  19. MFC中的几个虚函数
  20. java中配置JPA方法

热门文章

  1. logback配置详解(二)
  2. .NET(c#)new关键字的三种用法
  3. out 和 ref 之间的区别整理
  4. 1.Oracle数据库概述
  5. myeclipse中的web项目导入到eclipse中注意事项,项目部署到tomcat后无法访问jsp文件
  6. RFS_关键字
  7. box2dweb之关节joint(连接器)
  8. javascript学习笔记之DOM与表单
  9. ios之无限图片轮播器的实现
  10. notpad++安装python插件