html tool

2014年11月18日星期二

Python urllib2 timeout issue

通过爬蛛方式监控url位置的及其下资源的响应情况,发现一个问题
自己的异常做了except urllib2.URLError as e
按官网当帮助写的(https://docs.python.org/2/library/urllib2.html)
urllib2.urlopen(url[, data][, timeout])
Open the URL url, which can be either a string or a Request object.
    ...... 
    Raises URLError on errors.

但有超时时提示如下:
  File "F:\pc\working\test_selenium\sp-python\aya-web\checktime.py", line 15, in
 check
    self._getvale=self._fun(check_arg)
  File "D:\Python27\lib\urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "D:\Python27\lib\urllib2.py", line 404, in open
    response = self._open(req, data)
  File "D:\Python27\lib\urllib2.py", line 422, in _open
    '_open', req)
  File "D:\Python27\lib\urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "D:\Python27\lib\urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "D:\Python27\lib\urllib2.py", line 1187, in do_open
    r = h.getresponse(buffering=True)
  File "D:\Python27\lib\httplib.py", line 1067, in getresponse
    response.begin()
  File "D:\Python27\lib\httplib.py", line 409, in begin
    version, status, reason = self._read_status()
  File "D:\Python27\lib\httplib.py", line 365, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "D:\Python27\lib\socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
socket.error: [Errno 10053]
[go]一下,发现是urllib2 的一个bug,如下:
add:http://heyman.info/2010/apr/22/python-urllib2-timeout-issue/
------------------原文----------------------------
I use urllib2 from Python's standard library, in quite a few projects. It's quite nice, but the documentation isn't very comprehensive and it always makes me feel like I'm programming Java once I want to do something more complicated than just open an URL and read the response (i.e. handling redirect responses, reading response headers, etc).
Anyway, the other day I found - if not a bug - then at least an undocumented issue. Since Python 2.6, urllib2 provides a way to set the timeout time, like in the following code where the timeout is set to 2.5 seconds:
import urllib2

try:
    response = urllib2.urlopen("http://google.com", None, 2.5)
except URLError, e:
    print "Oops, timed out?"
If no timeout is specified, the global socket timeout value will be used, which by default is infinite.
The above code will catch almost every timeout, but the problem is that you might still get a timeout raised as a totally different exception:
File "/usr/lib/python2.4/socket.py", line 285, in read
  data = self._sock.recv(recv_size)
File "/usr/lib/python2.4/httplib.py", line 460, in read
  return self._read_chunked(amt)
File "/usr/lib/python2.4/httplib.py", line 495, in _read_chunked
  line = self.fp.readline()
File "/usr/lib/python2.4/socket.py", line 325, in readline
  data = recv(1)
socket.timeout: timed out
The solution is to catch this other exception, thrown by python's socket lib, as well:
import urllib2
import socket

try:
    response = urllib2.urlopen("http://google.com", None, 2.5)
except URLError, e:
    print "Oops, timed out?"
except socket.timeout:
    print "Timed out!"
Hopefully this will save someone else some headache :).

没有评论:

发表评论