2009
08.20

PageRank 改變計算方式

最近 PageRank 計算的公式改變,已經無法用舊的 HASH 值取回正確的 PR ,會出現錯誤訊息



一般來講,網路上取回 PR 的程式都是算出一串 URL 後丟給 Google ,然後把解讀回傳值,會像是這樣:

http://www.google.com/search?client=navclient-auto&ch=6-9223293982612316911&features=Rank&q=info:roga.tw

可以拆解成幾個部份來看:

$site = 'http://www.example.com'; /* 你的 url */
$info = 'info:' . urldecode($site);
$checksum = $this->checksum($this->strord($info));
$url = "http://www.google.com/search?client=navclient-auto&ch=6{$checksum}&features=Rank&q={$info}";

接著把 $url 用 curl 或是 fopen 之類的方法取回即可。

裡面最重要的就是計算 checksum 這組 hash value 的方法,舊有的方法如下,我們可以看到幾個地方有 Magic Number ,像是上面 $url 字串中包含的 &ch=6 以及後面接 $checksum,以及下面程式碼中計算 checksum$init = 0xE6359A60

    /**
     * Pagerank checksum hash emulator
     */
	function checksum ($url, $length = null, $init = 0xE6359A60)
	{
		if (is_null($length))
		{
			$length = sizeof($url);
		}
		$a = $b = 0x9E3779B9;
		$c = $init;
		$k = 0;
		$len = $length;

		while($len >= 12)
		{
			$a += ( $url[$k+0] + ( $url[$k+1] << 8 ) + ( $url[$k+2] << 16 ) + ( $url[$k+3] << 24 ));
			$b += ( $url[$k+4] + ( $url[$k+5] << 8 ) + ( $url[$k+6] << 16 ) + ( $url[$k+7] << 24 ));
			$c += ( $url[$k+8] + ( $url[$k+9] << 8 ) + ( $url[$k+10] << 16 ) + ( $url[$k+11] << 24 ));
			$mix = $this->mix($a,$b,$c);
			$a = $mix[0]; $b = $mix[1]; $c = $mix[2];
			$k += 12;
			$len -= 12;
		}

		$c += $length;

		switch($len)
		{
			case 11: $c += ($url[$k + 10] << 24);
			case 10: $c += ($url[$k + 9] << 16);
			case 9: $c += ($url[$k + 8] << 8);
			case 8: $b += ($url[$k + 7] << 24);
			case 7: $b += ($url[$k + 6] << 16);
			case 6: $b += ($url[$k + 5] << 8);
			case 5: $b += ($url[$k + 4]);
			case 4: $a += ($url[$k + 3] << 24);
			case 3: $a += ($url[$k + 2] << 16);
			case 2: $a += ($url[$k + 1] << 8);
			case 1: $a += ($url[$k + 0]);
		}

		$mix = $this->mix($a, $b, $c);

		return $mix[2];

    }
    /**
     * Converts number to int 32
     * (Required for pagerank hash)
     */
	function to_int_32 (&$x)
	{
		$z = hexdec(80000000);
		$y = (int) $x;
		if($y ==- $z && $x <- $z)
		{
			$y = (int) ((-1) * $x);
			$y = (-1) * $y;
		}
		$x = $y;
	}

    /**
     * Fills in zeros on a number
     * (Required for pagerank hash)
     */
	function zero_fill ($a, $b)
	{
		$z = hexdec(80000000);
		if ($z & $a)
		{
			$a = ($a >> 1);
			$a &= (~$z);
			$a |= 0x40000000;
			$a = ($a >> ($b - 1));
		}
		else
		{
			$a = ($a >> $b);
		}
		return $a;
	}

    /**
     * Pagerank hash prerequisites
     */
	function mix($a, $b, $c)
	{
		$a -= $b; $a -= $c; $this->to_int_32($a); $a = (int)($a ^ ($this->zero_fill($c,13)));
		$b -= $c; $b -= $a; $this->to_int_32($b); $b = (int)($b ^ ($a<<8));
		$c -= $a; $c -= $b; $this->to_int_32($c); $c = (int)($c ^ ($this->zero_fill($b,13)));
		$a -= $b; $a -= $c; $this->to_int_32($a); $a = (int)($a ^ ($this->zero_fill($c,12)));
		$b -= $c; $b -= $a; $this->to_int_32($b); $b = (int)($b ^ ($a<<16));
  		$c -= $a; $c -= $b; $this->to_int_32($c); $c = (int)($c ^ ($this->zero_fill($b,5)));
		$a -= $b; $a -= $c; $this->to_int_32($a); $a = (int)($a ^ ($this->zero_fill($c,3)));
		$b -= $c; $b -= $a; $this->to_int_32($b); $b = (int)($b ^ ($a<<10));
		$c -= $a; $c -= $b; $this->to_int_32($c); $c = (int)($c ^ ($this->zero_fill($b,15)));
		return array($a,$b,$c);
	}
    /**
     * ASCII conversion of a string
     */
    function strord($string)
    {
    	for($i = 0; $i < strlen($string); $i++)
    	{
    		$result[$i] = ord($string{$i});
    	}
    	return $result;
    }

    /**
     * Number formatting for use with pagerank hash
     */
    function format_number ($number='', $divchar = ',', $divat = 3)
    {
    	$decimals = '';
    	$formatted = '';

    	if (strstr($number, '.'))
    	{
    		$pieces = explode('.', $number);
    		$number = $pieces[0];
    		$decimals = '.' . $pieces[1];
    	}
    	else
    	{
    		$number = (string) $number;
    	}

    	if (strlen($number) <= $divat)
    		return $number;

    	$j = 0;

    	for ($i = strlen($number) - 1; $i >= 0; $i--)
    	{
    		if ($j == $divat)
    		{
    			$formatted = $divchar . $formatted;
    			$j = 0;
    		}
    		$formatted = $number[$i] . $formatted;
    		$j++;
    	}
    	return $formatted . $decimals;
    }

我後來在 http://wpcn.googlecode.com 找到一個新的 Google PageRank checksum 算法,它是針對 WP 寫的,稍微改寫一下就可以使用了。

$info = urlencode("info:".$site);
$checksum = $this->CheckHash($this->HashURL($site));
$url = "http://www.google.com/search?client=navclient-auto&ch=$checksum&features=Rank&=$info"

最重要的 $checksum 計算方法如下:


	//convert a string to a 32-bit integer
	function StrToNum($Str, $Check, $Magic) {
		$Int32Unit = 4294967296;  // 2^32
		$length = strlen($Str);
		for ($i = 0; $i < $length; $i++) {
			$Check *= $Magic;
			//If the float is beyond the boundaries of integer (usually +/- 2.15e+9 = 2^31),
			//  the result of converting to integer is undefined
			//  refer to http://www.php.net/manual/en/language.types.integer.php
			if ($Check >= $Int32Unit) {
				$Check = ($Check - $Int32Unit * (int) ($Check / $Int32Unit));
				//if the check less than -2^31
				$Check = ($Check < -2147483648) ? ($Check + $Int32Unit) : $Check;
			}
			$Check += ord($Str{$i});
		}
		return $Check;
	}

	//genearate a hash for a url
	function HashURL($String) {
		$Check1 = $this->StrToNum($String, 0x1505, 0x21);
		$Check2 = $this->StrToNum($String, 0, 0x1003F);
		$Check1 >>= 2;
		$Check1 = (($Check1 >> 4) & 0x3FFFFC0 ) | ($Check1 & 0x3F);
		$Check1 = (($Check1 >> 4) & 0x3FFC00 ) | ($Check1 & 0x3FF);
		$Check1 = (($Check1 >> 4) & 0x3C000 ) | ($Check1 & 0x3FFF);	

		$T1 = (((($Check1 & 0x3C0) << 4) | ($Check1 & 0x3C)) <<2 ) | ($Check2 & 0xF0F );
		$T2 = (((($Check1 & 0xFFFFC000) << 4) | ($Check1 & 0x3C00)) << 0xA) | ($Check2 & 0xF0F0000 );

		return ($T1 | $T2);
	}

	//genearate a checksum for the hash string
	function CheckHash($Hashnum) {
		$CheckByte = 0;
		$Flag = 0;
		$HashStr = sprintf('%u', $Hashnum) ;
		$length = strlen($HashStr);

		for ($i = $length - 1;  $i >= 0;  $i --) {
			$Re = $HashStr{$i};
			if (1 === ($Flag % 2)) {
				$Re += $Re;
				$Re = (int)($Re / 10) + ($Re % 10);
			}
			$CheckByte += $Re;
			$Flag ++;
		}

		$CheckByte %= 10;
		if (0 !== $CheckByte) {
			$CheckByte = 10 - $CheckByte;
			if (1 === ($Flag % 2) ) {
				if (1 === ($CheckByte % 2)) {
					$CheckByte += 9;
				}
				$CheckByte >>= 1;
			}
		}
		return '7'.$CheckByte.$HashStr;
	}

裡面一樣有 Magic Number ,像是 CheckHash() 的 return value。接著一樣把 $url 用 curl 或是 fopen 之類的方法取回即可。

唉,我還是不懂為什麼 Google 不提供取回 PR 的 API 。

6 comments so far

Add Your Comment
  1. 對啊,直接提供一個讓大家丟網址過去他就吐數字回來的小工具不好嗎 Q_Q

  2. roga大大,很抱歉在這邊打擾你,我是sitestates的使用者,因為在那邊po留言板一直失敗,所以在這邊向你求助

    我前天開始登入一直失敗,在我輸入完帳號密碼按了送出,結果畫面就像甚麼都沒發生一樣,帳號密碼欄還是空白等著我輸入的狀態

    我有試著去申請忘記密碼,用系統給的新帳號登入,可是情況也是一樣><

    ps,請問有辦法刪除以前自己留言版的留言嗎?一點小小的隱私問題@@

    • 您好,請試著清除瀏覽器的 Cookie 再試一次,謝謝您 :smile:

  3. roga大大,
    我發現是瀏覽器的問題,因為我下載了GreenBrowser就沒問題了
    如果以後有人跟我一樣是vista ie打不開的,
    清cookie也沒用的話
    可以叫他們下載這個瀏覽器試試看喔

    謝謝你^^

    • 非常抱歉呢!~ 謝謝你唷 ^^

  4. roga大大,
    我發現是瀏覽器的問題,因為我下載了GreenBrowser就沒問題了
    如果以後有人跟我一樣是vista ie打不開的,
    清cookie也沒用的話
    可以叫他們下載這個瀏覽器試試看喔
    謝謝你^^

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>