larbin之哈希之谈

由于工作原因，打算对larbin的源码进行分析一番

用的是2.6.3版本的larbin源码，由于这是业余，会断断续续的分析上传，已做记录笔记

今天我们分析一下larbin的哈希表

这个哈希表结构比较简单，因为它的主要用处是排重，因此只给出了用于排重的简单函数，

我们来看一下头文件怎么定义的：

// Larbin

// Sebastien Ailleret

// 23-11-99 -> 14-01-00

/* class hashTable

 * This class is in charge of making sure we don't crawl twice the same url

 */

#ifndef HASHTABLE_H

#define HASHTABLE_H

#include "types.h"

#include "utils/url.h"

class hashTable {

 private:

  ssize_t size;

  char *table;

 public:

  /* constructor */

  hashTable (bool create);

  /* destructor */

  ~hashTable ();

  /* save the hashTable in a file */

  void save();

  /* test if this url is allready in the hashtable

   * return true if it has been added

   * return false if it has allready been seen

   */

  bool test (url *U);

  /* set a url as present in the hashtable

   */

  void set (url *U);

  /* add a new url in the hashtable

   * return true if it has been added

   * return false if it has allready been seen

   */

  bool testSet (url *U);

};

#endif // HASHTABLE_H

由头文件我们可以看出，这个哈希表仅仅有四个成员函数（除了构造和析构）

save 函数是用于保存哈希表内部的数据，用于防止程序异常退出而造成数据丢失，因此把哈希内数据保存到一个文件中

test 函数用于测试参数指定的URL是否在哈希表内存在，只要是排重

set 函数就是判断出需要设置哈希表内值得时候设置该位置的URL对应的值，表示该URL从此开始存在于哈希表中

testset 是一个辅助函数，先判断，然后设置该位置的值，并且返回设置前的判断结果

下面我们就仔细来看一看各个函数的实现，比较简单，我就在程序中做了简单注释，就不再多余的文字解释了

构造函数：

hashTable::hashTable (bool create) {  //构造函数，申请哈希需求的空间并初始化

  ssize_t total = hashSize/;   //因为是位集合判断，所以每个字节8位，对于哈希的总成都除以8

  table = new char[total];      //申请哈希空间，其实这个地方主要是以数组巧妙勾勒哈希功能

  if (create) {                 //是一个标志，也就是说哈希内部的数据是从文件内读取还是初始化位0

    for (ssize_t i=; i<hashSize/; i++) {

      table[i] = ;

    }

  } else {

    int fds = open("hashtable.bak", O_RDONLY);   //从bak备份文件读取数据

    if (fds < ) {

      cerr << "Cannot find hashtable.bak, restart from scratch\n";

      for (ssize_t i=; i<hashSize/; i++) {     //如果打开备份文件失败，就重新赋值位0，当做第一次看待

        table[i] = ;

      }

    } else {

      ssize_t sr = ;

      while (sr < total) {

        ssize_t tmp = read(fds, table+sr, total-sr); //然后循环读取文件，直到成功读取所有数据

        if (tmp <= ) {

          cerr << "Cannot read hashtable.bak : "

               << strerror(errno) << endl;

          exit();

        } else {

          sr += tmp;        //增加每次读取的数据

        }

      }

      close(fds);          //关闭文件描述符

    }

  }

}

析构函数：

hashTable::~hashTable () {    //析构函数，释放哈希申请的空间

  delete [] table;

}

测试函数test:

bool hashTable::test (url *U) {    //判断该url对应的是否存在哈希中，如果存在返回true，否则false

  unsigned int code = U->hashCode();  //根据hashCode函数求散列值

  unsigned int pos = code / ;

  unsigned int bits =  << (code % );

  return table[pos] & bits;

}

设置函数：

void hashTable::set (url *U) {   //设置url对应哈希值

  unsigned int code = U->hashCode();

  unsigned int pos = code / ;

  unsigned int bits =  << (code % );

  table[pos] |= bits;

}

测试设置函数：

bool hashTable::testSet (url *U) { //返回测试结果，并且设置url对应的值

  unsigned int code = U->hashCode();

  unsigned int pos = code / ;

  unsigned int bits =  << (code % );

  int res = table[pos] & bits;

  table[pos] |= bits;

  return !res;

}

保存文件函数：

void hashTable::save() {     //把哈希内部数据存到文件，该过程是循序渐进的，防止时间间隔过程造成数据大量失真

  rename("hashtable.bak", "hashtable.old");   //先把先前备份文件保存，直到最后本次成功备份后删除

  int fds = creat("hashtable.bak", );

  if (fds >= ) {

    ecrireBuff(fds, table, hashSize/);       //辅助函数，把哈希数据存到备份文件

    close(fds);

  }

  unlink("hashtable.old");

}

该哈希的处理部分就这么多，下面重点来看看我们两个知识点

1，散列函数 hashCode：

uint url::hashCode () {

  unsigned int h=port;

  unsigned int i=;

  while (host[i] != ) {

    h = *h + host[i];

    i++;

  }

  i=;

  while (file[i] != ) {

    h = *h + file[i];

    i++;

  }

  return h % hashSize;

}

说起来散列函数，要求很有艺术的，而且散列函数也不可能有百分百的通用性。

一般都要自己根据哈希设置自己的散列函数。最起码要设置某些数值，用同一个散列方法和框架

该散列函数比较简单，就是把host和file（URL类中的两个字段，表示主机和文件路径）依次乘以31

然后对哈希最大值求余数，最大值这样定义的：

#define hashSize 64000000

另外对于host和file的先后哈希顺序也是设计的，先host而后file是为了让同一host对应的file的差异更大，减缓相似冲突

2，下面我们就来谈谈URL这个类，上面那个哈希散列函数就是这个类中的一个成员函数，之所以单独摘出去说，是因为散列函数也是重大的一块

我们先来看一下URL类的定义：

class url {

 private:

  char *host;

  char *file;

  uint16_t port; // the order of variables is important for physical size

  int8_t depth;

  /* parse the url */

  void parse (char *s);

  /** parse a file with base */

  void parseWithBase (char *u, url *base);

  /* normalize file name */

  bool normalize (char *file);

  /* Does this url starts with a protocol name */

  bool isProtocol (char *s);

  /* constructor used by giveBase */

  url (char *host, uint port, char *file);

 public:

  /* Constructor : Parses an url (u is deleted) */

  url (char *u, int8_t depth, url *base);

  /* constructor used by input */

  url (char *line, int8_t depth);

  /* Constructor : read the url from a file (cf serialize) */

  url (char *line);

  /* Destructor */

  ~url ();

  /* inet addr (once calculated) */

  struct in_addr addr;

  /* Is it a valid url ? */

  bool isValid ();

  /* print an URL */

  void print ();

  /* return the host */

  inline char *getHost () { return host; }

  /* return the port */

  inline uint getPort () { return port; }

  /* return the file */

  inline char *getFile () { return file; }

  /** Depth in the Site */

  inline int8_t getDepth () { return depth; }

  /* Set depth to max if we are at an entry point in the site

   * try to find the ip addr

   * answer false if forbidden by robots.txt, true otherwise */

  bool initOK (url *from);

  /** return the base of the url

   * give means that you have to delete the string yourself

   */

  url *giveBase ();

  /** return a char * representation of the url

   * give means that you have to delete the string yourself

   */

  char *giveUrl ();

  /** write the url in a buffer

   * buf must be at least of size maxUrlSize

   * returns the size of what has been written (not including '\0')

   */

  int writeUrl (char *buf);

  /* serialize the url for the Persistent Fifo */

  char *serialize ();

  /* very thread unsafe serialisation in a static buffer */

  char *getUrl();

  /* return a hashcode for the host of this url */

  uint hostHashCode ();

  /* return a hashcode for this url */

  uint hashCode ();

#ifdef URL_TAGS

  /* tag associated to this url */

  uint tag;

#endif // URL_TAGS

#ifdef COOKIES

  /* cookies associated with this page */

  char *cookie;

  void addCookie(char *header);

#else // COOKIES

  inline void addCookie(char *header) {}

#endif // COOKIES

};

URL这个类的声明比较复杂，因为它把设计URL的部分集中在这一块

我们分开来分析它的实现：

先看一个构造函数

url::url (char *u, int8_t depth, url *base) {

  newUrl();

  this->depth = depth;

  host = NULL;

  port = ;

  file = NULL;

  initCookie();

#ifdef URL_TAGS

  tag = ;

#endif // URL_TAGS

  if (startWith("http://", u)) {

    // absolute url

    parse (u + );

    // normalize file name

    if (file != NULL && !normalize(file)) {

      delete [] file;

      file = NULL;

      delete [] host;

      host = NULL;

    }

  } else if (base != NULL) {

    if (startWith("http:", u)) {

      parseWithBase(u+, base);

    } else if (isProtocol(u)) {

      // Unknown protocol (mailto, ftp, news, file, gopher...)

    } else {

      parseWithBase(u, base);

    }

  }

}

这是其中一个url类的构造函数，它接受3个参数

第一个参数 u 是URL字符串

第二个参数 depth 是爬虫深度

第三个参数 base 是基类URL，也就是要解析需要用的上一个URL，相当于referer

它虽然只是一个构造函数，但是设计中为了更方便的使用该类，在其构造函数中进行了一系列的处理，相当于我们只要构造这个对象，就获取了相应的转化处理

newUrl() 是一个宏定义，调试用的：

#define newUrl() debUrl++

#define delUrl() debUrl--

然后就是一些初始化变量

下面调用了一个 initCookie函数，该函数是宏条件控制的宏：

#ifdef COOKIES

#define initCookie() cookie=NULL

#else // COOKIES

#define initCookie() ((void) 0)

#endif // COOKIES

其实就是对cookie变量赋值为NULL，或者一个空语句

接着，如果使用tag，用宏开关设置tag为0

然后根据url字符串判断是否是http://开始的标准绝对路径URL

如果是绝对路径URL 就从后面开始解析这个url字符串。

parse是URL解析函数，随后分析

调用parse，可以分析出 host和file，然后下一步，如果file不为空，就对file就行序列化调整

也就是把file中不合适的路径修改合适标准，此处调用的事normalize

相反，如果不是url绝对路径，如果base为NULL，则无法判断决定这个url因此就不处理，返回

否则表示可以组建url

如果是以http：开始的，就调用parseWithBase函数解析url

否则，如果使用了不识别的协议，就无法解析

其他情况调用 parseWithBase函数解析

这样这个URL构造函数就结束了，我们对处理的流程清楚，但是其中有几个重点函数要具体分析一下

1，file序列化函数 normalize函数

bool fileNormalize (char *file) {

  int i=;

  while (file[i] !=  && file[i] != '#') {

    if (file[i] == '/') {               //如果当前字符是/， 我们分情况处理下一个字符是什么

      if (file[i+] == '.' && file[i+] == '/') {   //如果接下来的字符是./也就是出现/./的情况，我们把这个路径

        // suppress /./                             //规范化，去掉这三个字符，过程就是从后向前覆盖

        int j=i+;

        while (file[j] != ) {

          file[j-] = file[j];

          j++;

        }

        file[j-] = ;

      } else if (file[i+] == '/') {              //如果下一个是/，也就是出现 // 的情况我们把它替换为 一个 /

        // replace // by /

        int j=i+;

        while (file[j] != ) {

          file[j-] = file[j];

          j++;

        }

        file[j-] = ;

      } else if (file[i+] == '.' && file[i+] == '.' && file[i+] == '/') { //如果出现 /../的情况，相当于

        // suppress /../                             //回到父目录，因此我们回朔到父目录，覆盖回朔过程中无用的路径

        if (i == ) {

          // the file name starts with /../ : error  //如果是刚开始出现这种情况，视为错误

          return false;

        } else {

          int j = i+, dec;

          i--;

          while (file[i] != '/') { i--; }

          dec = i+-j; // dec < 0

          while (file[j] != ) {

            file[j+dec] = file[j];

            j++;

          }

          file[j+dec] = ;

        }

      } else if (file[i+] == '.' && file[i+] == ) {  //如果为 /.0 这种情况，则表示当前路径，直接赋值结束

        // suppress /.

        file[i+] = ;

        return true;

      } else if (file[i+] == '.' && file[i+] == '.' && file[i+] == ) {

        // suppress /..                        //如果是 /..0 这样情况

        if (i == ) {

          // the file name starts with /.. : error    //如果在开始，则按错误处理

          return false;

        } else {          //如果出现在路径中间，则表示上级目录，回朔覆盖

          i--;

          while (file[i] != '/') {

            i--;

          }

          file[i+] = ;

          return true;

        }

      } else { // nothing special, go forward   //如果是其他正常情况，继续处理

        i++;

      }

    } else if (file[i] == '%') {   //因为url中一个字符不能出现就需要转义，转义是以%开始，加两个十六进制数字

      int v1 = int_of_hexa(file[i+]);

      int v2 = int_of_hexa(file[i+]);

      if (v1 <  || v2 < ) return false;

      char c =  * v1 + v2;   //解析转义字符

      if (isgraph(c)) {   //如果是可视字符，把它转为可视字符，然后多余字符覆盖移动

        file[i] = c;

        int j = i+;

        while (file[j] != ) {

          file[j-] = file[j];

          j++;

        }

        file[j-] = ;

        i++;

      } else if (c == ' ' || c == '/') { // keep it with the % notation //如果是空格或者／　则不处理，跳过

        i += ;

      } else { // bad url　　　 //其他情况则视为错误

        return false;

      }

    } else { // nothing special, go forward  //正常的流程，继续执行

      i++;

    }

  }

  file[i] = ;

  return true;

}

这是一个辅助函数，主要用于纠正分析文件路径的错误不识别情况，上面代码已经注释的很清楚
其中用到的一个辅助函数为：

static int int_of_hexa (char c) {  //十六进制数字转化为10进制

  if (c >= '' && c <= '')

    return (c - '');

  else if (c >= 'a' && c <= 'f')

    return (c - 'a' + );

  else if (c >= 'A' && c <= 'F')

    return (c - 'A' + );

  else

    return -;

}

这个就是16进制字符串转化为10进制，很简单易懂

2，小的辅助函数，不多做解释：

bool startWith (char *a, char *b) {

  int i=;

  while (a[i] != ) {

    if (a[i] != b[i]) return false;

    i++;

  }

  return true;

}

3，url构造函数还有两个重量级的函数 parse 和 parseWithBase两个

我们先来分析一下parse函数：

void url::parse (char *arg) {

  int deb = , fin = deb;

  // Find the end of host name (put it into lowerCase)

  while (arg[fin] != '/' && arg[fin] != ':' && arg[fin] != ) {

    fin++;

  }

  if (fin == ) return;

  // get host name

  host = new char[fin+];

  for (int  i=; i<fin; i++) {

    host[i] = lowerCase(arg[i]);

  }

  host[fin] = ;

  // get port number

  if (arg[fin] == ':') {

    port = ;

    fin++;

    while (arg[fin] >= '' && arg[fin] <= '') {

      port = port* + arg[fin]-'';

      fin++;

    }

  }

  // get file name

  if (arg[fin] != '/') {

    // www.inria.fr => add the final /

    file = newString("/");

  } else {

    file = newString(arg + fin);

  }

}

这个代码写的太有爱了，一目了然，我们很容易明白，这就是一个字符串解析

然后分析出 host，port，和file。

4，然后下一个解析函数 parseWithBase函数：

void url::parseWithBase (char *u, url *base) {

  // cat filebase and file

  if (u[] == '/') {

    file = newString(u);

  } else {

    uint lenb = strlen(base->file);

    char *tmp = new char[lenb + strlen(u) + ];

    memcpy(tmp, base->file, lenb);

    strcpy(tmp + lenb, u);

    file = tmp;

  }

  if (!normalize(file)) {

    delete [] file;

    file = NULL;

    return;

  }

  host = newString(base->host);

  port = base->port;

}

这个函数主要是使用了已知URL类来构造新的URL，主机和端口都不会变

相当于相对路径的基准路径，然后根据URL路径+现在路径然后序列化位标准路经

就是相对路径转化为绝对路径，这个相对比较简单

这样这个url的构造就结束了，

巴特西

larbin之哈希之谈

最新文章

热门文章