python爬虫脚本编写

简介：

Python爬虫，又称为网络爬虫或网页蜘蛛，是一种自动化浏览互联网并抓取网页信息的程序。Python因其简洁的语法、丰富的库支持（如requests、BeautifulSoup、Scrapy等）以及强大的网络处理能力，成为了编写爬虫的首选语言之一。下面简要介绍Python爬虫的基本概念、用途、基本步骤及注意事项。

用途：

数据采集：从网站抓取数据，如新闻、股票价格、天气信息等，用于数据分析、机器学习等。
搜索引擎：搜索引擎的底层技术之一就是爬虫，它们爬取互联网上的网页，并构建索引数据库，供用户搜索。
内容聚合：将多个来源的信息整合到一起，如RSS阅读器、新闻聚合网站等。
自动化测试：模拟用户行为，对网站进行自动化测试，检查网站是否存在问题。

一、实现python post登入

这里基于phpstudy 搭建的dvwa靶场进行登入，因为还要user_token以及登入动作Login卡了一会，用requests模块的session方法进行维持通信通道，用BeautifulSoup模块去获取user_token，结果成功的情况是可以获取index.php的html源码

脚本如下：

import requests
from bs4 import BeautifulSoup

session = requests.Session()
resp = session.get('http://192.168.1.9/dvwa-master/login.php')
soup = BeautifulSoup(resp.text)
user_token = soup.find('input',{'name': 'user_token'})['value']

data = {
    'username': 'admin',
    'password': 'password',
    'Login': 'Login',
    'user_token': user_token
}

login_resp = session.post('http://192.168.1.9/dvwa-master/login.php', data)
if login_resp.status_code == 200:
    print('登入成功')
    resp_index = session.get('http://192.168.1.9/dvwa-master/index.php')
    print(resp_index.text)
else:
    print('登入失败')

二、利用正则表达式实现爬虫

该代码的思路是用正则匹配进行查找可行链接，然后进行手动拼接，并且用到了递规的思路对已获得的url进行再次链接抓取

缺点：并不能抓取所有的网页，这种写法只能针对特定的网页，依据其html结构进行针对抓取，过滤规则也要特定修改
优点：使用的是try except结构，碰到范围内错误能自行处理

实验网站链接：airunfive’s blog

代码如下：

import re
import requests
from urllib.parse import urljoin

def crawler(url, Links=None):
    if Links is None:
        Links = set()
    try:
        new_links = [] # 记录每次增加的url，这样就可以不用每次遍历全部url
        session = requests.Session()
        
        resp = session.get(url, timeout=10) 
        resp.raise_for_status()  # 如果响应状态码不是200，则抛出HTTPError异常  

        links = re.findall('<a href="(.+?)"', resp.text)
        for link in links:
            if link.startswith('#') or link =='/':
                continue

            if link.startswith('/'):
                link = urljoin(url, link)
            
            if not link in Links:
                new_links.append(link)# 单独记录新递归后增加的url
            
            Links.add(link) # 将新递归的url添加到总url列表里,因为用的是集合所以会自动排除一样的url
             
        length = len(new_links)
        for i in range(length): # 仅仅只是递归新增加的url
            url = new_links[i]
            if not url.startswith('https://airunfive.github.io/'):
                continue
            crawler(url, Links) # 记录返回的总url
    except requests.RequestException as e:
        print(f"请求错误: {e}")  
    except Exception as e:  
        print(f"发生未知错误: {e}") 

    return Links
            


if __name__ == '__main__':
    Links = crawler('https://airunfive.github.io/')
    
    for link in Links:
        print(link)

# 结果如下，可以拉取博客所有的链接（展示部分）：

https://airunfive.github.io/tags/web%E5%8E%9F%E7%90%86%E8%AE%B2%E8%A7%A3/
https://airunfive.github.io/2023/07/11/%E6%96%87%E4%BB%B6%E5%8C%85%E5%90%AB%E5%81%9A%E9%A2%98%E8%AE%B0%E5%BD%95/        
http://airunfive.github.io/2023/01/30/%E5%A0%86%E6%BC%8F%E6%B4%9Eoffbyone/
http://airunfive.github.io/2023/04/25/upload-labs%E9%80%9A%E5%85%B3%E8%A7%A3%E6%9E%90/
http://airunfive.github.io/2022/11/01/%E6%B1%87%E7%BC%96%E6%8C%87%E4%BB%A4%E5%B0%8F%E7%BB%93/
http://airunfive.github.io/2023/07/28/%E5%BA%8F%E5%88%97%E5%8C%96%E4%B9%8Bpop%E9%93%BE%E6%9E%84%E9%80%A0/
https://airunfive.github.io/tags/web%E5%81%9A%E9%A2%98%E8%AE%B0%E5%BD%95/
http://airunfive.github.io/2023/07/10/buuctf%E5%88%B7%E9%A2%98%E8%AE%B0%E5%BD%951/
https://airunfive.github.io/archives/2022/page/2/
https://airunfive.github.io/2022/10/27/%E6%B5%85%E6%9E%90jmp%E6%8C%87%E4%BB%A4%E7%9A%84%E5%8E%9F%E7%90%86/
https://airunfive.github.io/tags/pwn-write-up-pwn%E5%8E%9F%E7%90%86%E8%AE%B2%E8%A7%A3/
http://airunfive.github.io/2023/07/11/%E5%91%BD%E4%BB%A4%E6%B3%A8%E5%85%A5%E5%81%9A%E9%A2%98%E8%AE%B0%E5%BD%95/

。。。。。。。
    
https://airunfive.github.io/tags/%E6%96%87%E4%BB%B6%E5%8C%85%E5%90%AB/
https://airunfive.github.io/2022/12/27/CGfsb-wp/
http://airunfive.github.io/2023/03/17/%E6%A0%BC%E5%BC%8F%E5%8C%96%E5%AD%97%E7%AC%A6%E4%B8%B2%E6%BC%8F%E6%B4%9E%E5%81%8F%E7%A7%BB%E7%9A%84%E4%BE%BF%E6%8D%B7%E7%AE%97%E6%B3%95/
https://airunfive.github.io/2022/09/25/%E7%99%BE%E5%BA%A6%E6%98%AF%E4%B8%AA%E5%A5%BD%E4%B8%9C%E8%A5%BF/
https://airunfive.github.io/2023/07/15/%E5%BA%8F%E5%88%97%E5%8C%96%E4%B8%8E%E5%8F%8D%E5%BA%8F%E5%88%97%E5%8C%96/        
http://airunfive.github.io/2023/07/11/%E6%96%87%E4%BB%B6%E5%8C%85%E5%90%AB%E5%81%9A%E9%A2%98%E8%AE%B0%E5%BD%95/
http://airunfive.github.io/2023/01/05/%E6%89%93%E5%8D%A1%EF%BC%9A%E7%9B%B2%E6%89%93%E9%A2%98-warm-up/
http://airunfive.github.io/2023/08/24/phar%E5%8F%8D%E5%BA%8F%E5%88%97%E5%8C%96/
http://airunfive.github.io/2023/03/25/sqli-labs-1-4/

三、用BeautifulSoup模块实现爬虫

优点：用BeautifulSoup模块可以构建html树结构，可以更精准的匹配拉取链接
缺点：需要的时间比较长，因为每次BeautifulSoup都要对拉取的html网页先进行构造html树

1、当前页爬取图片

from bs4 import BeautifulSoup
import requests
import time

def crawler_png():
    session = requests.Session()
    resp = session.get('https://airunfive.github.io/')
    http_tree = BeautifulSoup(resp.text, 'html.parser')
    links = http_tree.find_all('img')
    
    for link in links:
        value = link.get('src')
        if value and value.startswith('/'):
            value = 'https://airunfive.github.io' + value
            filename = time.strftime('%Y%m%d_%H%M%S_') + value.split('/')[-1]
            with open("D:/vscode/security_python/crawler_test/images/"+filename, mode='wb') as f:
                resp = session.get(value)
                f.write(resp.content)
                
if __name__ == '__main__':
    crawler_png()

2、递归爬取链接

from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin

def crawler(url, Links=None):
    if Links is None:
        Links = set()
    try:
        new_links = []
        session = requests.Session()
        resp = session.get('https://airunfive.github.io/')
        http_tree = BeautifulSoup(resp.text, 'html.parser')
        links = http_tree.find_all('a')

        for link in links:
            value = link.get('href') # 先获取，因为不是所有的a标签都有href值，于是还要判断value是否为空
            
            if value and value.startswith('/'):
                value = urljoin(url, value)
            
            if not link in Links:
                new_links.append(value)# 单独记录新递归后增加的url
                
            Links.add(value) # 将新递归的url添加到总url列表里,因为用的是集合所以会自动排除一样的url
                
            length = len(new_links)
            for i in range(length): # 仅仅只是递归新增加的url
                url = new_links[i]
                if not url.startswith('https://airunfive.github.io/'):
                    continue
                crawler(url, Links) # 记录返回的总url
    
    except requests.RequestException as e:
        print(f"请求错误: {e}")  
    except Exception as e:  
        print(f"发生未知错误: {e}") 
    return Links

if __name__ == '__main__':
    Links = crawler('https://airunfive.github.io')
    
    for link in Links:
        print(link)