Pricing

Overcoming Proxy Restrictions for Web Crawlers in Python

In this article, we explore how to set up a ProxyHandler proxy for urllib2 and modify the code to overcome these restrictions, allowing us to successfully download images from a forum and save them locally.

Those who have worked on web crawlers would know that Python's urllib2 is very convenient to use. With just a few lines of code, you can easily obtain the source code of a website:

#coding=utf-8
import urllib
import urllib2
import re

url = "http://wetest.qq.com"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

 

Finally, you can get the desired information by using certain regular expressions to match and parse the returned response content.

However, this method may not work for some external websites in office and development networks.

For example, when trying to access http://tieba.baidu.com/p/2460150866, a 10060 error code is reported, indicating a connection failure.

#coding=utf-8
import urllib
import urllib2
import re

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

 

After execution, the error message screenshot is as follows:

To analyze the cause of this issue, the following steps were taken:

1. Entering the URL in a browser can open the website normally, indicating that the site is accessible.

2. Running the same script on the company's experience network works fine, indicating that there is no problem with the script itself.

Based on these two steps, it is preliminarily determined that the issue is caused by the company's access policy restrictions for external websites. Therefore, I looked up how to set a ProxyHandler proxy for urllib2 and modified the code as follows:

#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

 

After running the modified code, the desired HTML page can be obtained.

Is it over now? Not yet! The goal is to obtain various beautiful images from the forum and save them locally. Let's move on to the code:

```python
#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()

# Regular expression matching
reg = r'src="(.+?\.jpg)" pic_ext'
imgre = re.compile(reg)
imglist = re.findall(imgre,html)
print 'start dowload pic'
x = 0
for imgurl in imglist:
    urllib.urlretrieve(imgurl,'pic\\%s.jpg' % x)
    x = x+1
```

 

After running the code again, an error still occurs! It's the 10060 error again. I've set the proxy for urllib2, so why is there still an error?

So, I continued to find a solution, determined to obtain various beautiful images from the forum. Since regular expressions can be used to obtain the URLs of images in the forum, why not manually call urllib2.urlopen to open the corresponding URL, obtain the corresponding response, then read the binary data of the image, and finally save the image to a local file? This led to the following code:

```python
#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()

# Regular expression matching
reg = r'src="(.+?\.jpg)" pic_ext'
imgre = re.compile(reg)
imglist = re.findall(imgre,html)

x = 0
print 'start'
for imgurl in imglist:
    print imgurl
    resp = urllib2.urlopen(imgurl)
    respHtml = resp.read()
    picFile = open('%s.jpg' % x, "wb")
    picFile.write(respHtml)
    picFile.close()
    x = x+1
print 'done'
```

 

After running the code again, it was found that the image URLs were printed as expected, and the images were also saved.

At this point, the original goal has been achieved. I hope the summarized content is also useful for other friends.

Latest Posts
1Seamless and Effective Device Management with WeTest UDT UDT's robust device management capabilities can ensure optimal testing efficiency and quality assurance.
2Overcoming Test Resource Expansion Challenges in Automated Testing WeTest UDT provides solution to test resource expansion by offering scalable cloud-based device resources, hybrid access and management options, and efficient automated testing capabilities.
3Introduction to Common Automated Testing Frameworks and Integrations Let's explore common automated testing frameworks and how they integrate into modern development workflows, helping businesses maintain consistency and reliability in their software.
4Streamlining Project and Permission Management with UDT Automated testing not only improves collaboration and task tracking, effective management also helps in optimizing workflow, ensuring that tasks are carried out in a timely manner and that each project progresses smoothly.
5How to Achieve Better Resource Management with UDT UDT provides a comprehensive suite of resource management tools that allow testing teams to manage their resources with ease.