Python urllib库使用

urllib是Python内置的HTTP请求库，用于处理网页URL和抓取网页内容。这个库不需要额外安装，可以直接使用。下面我们来详细了解它的主要功能和用法。

urllib主要模块

urllib包含四个核心模块：

urllib.request - 负责打开和读取URL
urllib.error - 处理请求过程中出现的异常
urllib.parse - 解析URL地址
urllib.robotparser - 解析网站的robots.txt文件

基本网页抓取

使用urllib.request模块可以轻松获取网页内容。最基本的方法是urlopen()：

from urllib.request import urlopen

# 打开网页
response = urlopen("https://www.fly63.com/")
# 读取网页内容
html_content = response.read()
print(html_content)

urlopen()方法有几个常用参数：

url：目标网页地址
data：向服务器发送的数据，默认为None
timeout：设置请求超时时间

如果只需要读取部分内容，可以指定读取的字节数：

from urllib.request import urlopen

response = urlopen("https://www.fly63.com/")
# 只读取前300个字节
print(response.read(300))

除了read()方法，还有其他读取方式：

from urllib.request import urlopen

response = urlopen("https://www.fly63.com/")

# 读取一行
print(response.readline())

# 读取所有行，返回列表
lines = response.readlines()
for line in lines:
    print(line)

检查网页状态

在实际应用中，我们需要检查网页是否能正常访问：

import urllib.request

# 正常网页
response1 = urllib.request.urlopen("https://www.fly63.com/")
print(response1.getcode())  # 输出200表示正常

# 处理不存在的网页
try:
    response2 = urllib.request.urlopen("https://www.fly63.com/no-page.html")
except urllib.error.HTTPError as e:
    if e.code == 404:
        print("网页不存在")  # 输出404表示网页不存在

常见的HTTP状态码：

200：请求成功
404：网页不存在
500：服务器内部错误
403：禁止访问

保存网页内容

可以将抓取的网页保存到本地文件：

from urllib.request import urlopen

response = urlopen("https://www.fly63.com/")
# 以二进制写入模式打开文件
with open("fly63_homepage.html", "wb") as file:
    content = response.read()  # 读取网页内容
    file.write(content)  # 写入文件

使用with语句可以自动关闭文件，更加安全。

URL编码与解码

处理URL时，经常需要对特殊字符进行编码和解码：

import urllib.request

original_url = "https://www.fly63.com/python tutorial"
# URL编码
encoded_url = urllib.request.quote(original_url)
print(encoded_url)  # 输出：https%3A//www.fly63.com/python%20tutorial

# URL解码
decoded_url = urllib.request.unquote(encoded_url)
print(decoded_url)  # 输出原始URL

设置请求头信息

许多网站会检查请求头信息，特别是User-Agent。没有合适的头部信息可能导致请求被拒绝：

import urllib.request
import urllib.parse

url = 'https://www.fly63.com/?s='
keyword = 'Python 教程'
encoded_keyword = urllib.request.quote(keyword)  # 对关键词编码
full_url = url + encoded_keyword

# 设置请求头，模拟浏览器
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# 创建请求对象并添加头部
request = urllib.request.Request(full_url, headers=headers)
response = urllib.request.urlopen(request)
content = response.read()

# 保存搜索结果
with open("search_result.html", "wb") as file:
    file.write(content)

发送POST请求

向服务器提交数据需要使用POST请求：

import urllib.request
import urllib.parse

# 目标URL
url = 'https://www.fly63.com/login'
# 提交的数据
post_data = {'username': 'testuser', 'password': 'testpass'}
# 设置请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Content-Type': 'application/x-www-form-urlencoded'
}

# 对数据进行编码
encoded_data = urllib.parse.urlencode(post_data).encode('utf-8')
# 创建请求对象
request = urllib.request.Request(url, encoded_data, headers)
# 发送请求
response = urllib.request.urlopen(request)
result = response.read()

print("提交成功")

错误处理

网络请求可能会遇到各种问题，需要妥善处理异常：

import urllib.request
import urllib.error

def safe_url_open(url):
    try:
        response = urllib.request.urlopen(url)
        print(f"请求成功，状态码：{response.getcode()}")
        return response.read()
    except urllib.error.HTTPError as e:
        print(f"HTTP错误：{e.code} - {e.reason}")
    except urllib.error.URLError as e:
        print(f"URL错误：{e.reason}")
    except Exception as e:
        print(f"其他错误：{str(e)}")
    return None

# 测试函数
safe_url_open("https://www.fly63.com/")
safe_url_open("https://www.fly63.com/not-exist-page")

解析URL

urllib.parse模块可以分解URL的各个部分：

from urllib.parse import urlparse

result = urlparse("https://www.fly63.com/python/tutorial?page=1#intro")
print(f"协议：{result.scheme}")      # https
print(f"域名：{result.netloc}")      # www.fly63.com
print(f"路径：{result.path}")        # /python/tutorial
print(f"查询参数：{result.query}")   # page=1
print(f"锚点：{result.fragment}")    # intro

实用技巧

设置超时时间

import urllib.request

try:
    # 设置5秒超时
    response = urllib.request.urlopen("https://www.fly63.com/", timeout=5)
    content = response.read()
except urllib.error.URLError as e:
    print(f"请求超时或失败：{e.reason}")

批量下载文件

import urllib.request

def download_files(url_list, save_path):
    for i, url in enumerate(url_list):
        try:
            response = urllib.request.urlopen(url)
            content = response.read()
            
            filename = f"{save_path}/file_{i+1}.html"
            with open(filename, "wb") as file:
                file.write(content)
            print(f"已下载：{filename}")
        except Exception as e:
            print(f"下载失败 {url}: {str(e)}")

# 使用示例
urls = [
    "https://www.fly63.com/python",
    "https://www.fly63.com/java",
    "https://www.fly63.com/javascript"
]
download_files(urls, "./downloads")

总结

urllib是Python中功能强大的HTTP库，适合各种网页抓取任务。它提供了从简单到复杂的所有必要工具，包括发送请求、处理响应、错误处理和URL解析等功能。

对于初学者来说，urllib是学习网络编程的好起点。掌握了它之后，你可能会想了解更高级的库，比如requests，它在urllib的基础上提供了更简洁的接口。

记住，在实际使用中要遵守网站的robots.txt规则，尊重网站的使用条款，不要对服务器造成过大压力。

本文内容仅供个人学习/研究/参考使用，不构成任何决策建议或专业指导。分享/转载时请标明原文来源，同时请勿将内容用于商业售卖、虚假宣传等非学习用途哦～感谢您的理解与支持！

链接: https://fly63.com/course/36_2120

<< Python操作MongoDB Python uWSGI 安装与配置 >>