python爬虫脚本编写
简介:
Python爬虫,又称为网络爬虫或网页蜘蛛,是一种自动化浏览互联网并抓取网页信息的程序。Python因其简洁的语法、丰富的库支持(如requests、BeautifulSoup、Scrapy等)以及强大的网络处理能力,成为了编写爬虫的首选语言之一。下面简要介绍Python爬虫的基本概念、用途、基本步骤及注意事项。
用途:
数据采集:从网站抓取数据,如新闻、股票价格、天气信息等,用于数据分析、机器学习等。
搜索引擎:搜索引擎的底层技术之一就是爬虫,它们爬取互联网上的网页,并构建索引数据库,供用户搜索。
内容聚合:将多个来源的信息整合到一起,如RSS阅读器、新闻聚合网站等。
自动化测试:模拟用户行为,对网站进行自动化测试,检查网站是否存在问题。
一、实现python post登入
这里基于phpstudy 搭建的dvwa靶场进行登入,因为还要user_token以及登入动作Login卡了一会,用requests模块的session方法进行维持通信通道,用BeautifulSoup模块去获取user_token,结果成功的情况是可以获取index.php的html源码
脚本如下:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
resp = session.get('http://192.168.1.9/dvwa-master/login.php')
soup = BeautifulSoup(resp.text)
user_token = soup.find('input',{'name': 'user_token'})['value']
data = {
'username': 'admin',
'password': 'password',
'Login': 'Login',
'user_token': user_token
}
login_resp = session.post('http://192.168.1.9/dvwa-master/login.php', data)
if login_resp.status_code == 200:
print('登入成功')
resp_index = session.get('http://192.168.1.9/dvwa-master/index.php')
print(resp_index.text)
else:
print('登入失败')
二、利用正则表达式实现爬虫
该代码的思路是用正则匹配进行查找可行链接,然后进行手动拼接,并且用到了递规的思路对已获得的url进行再次链接抓取
缺点:并不能抓取所有的网页,这种写法只能针对特定的网页,依据其html结构进行针对抓取,过滤规则也要特定修改
优点:使用的是try except结构,碰到范围内错误能自行处理
实验网站链接:airunfive’s blog
代码如下:
import re
import requests
from urllib.parse import urljoin
def crawler(url, Links=None):
if Links is None:
Links = set()
try:
new_links = [] # 记录每次增加的url,这样就可以不用每次遍历全部url
session = requests.Session()
resp = session.get(url, timeout=10)
resp.raise_for_status() # 如果响应状态码不是200,则抛出HTTPError异常
links = re.findall('<a href="(.+?)"', resp.text)
for link in links:
if link.startswith('#') or link =='/':
continue
if link.startswith('/'):
link = urljoin(url, link)
if not link in Links:
new_links.append(link)# 单独记录新递归后增加的url
Links.add(link) # 将新递归的url添加到总url列表里,因为用的是集合所以会自动排除一样的url
length = len(new_links)
for i in range(length): # 仅仅只是递归新增加的url
url = new_links[i]
if not url.startswith('https://airunfive.github.io/'):
continue
crawler(url, Links) # 记录返回的总url
except requests.RequestException as e:
print(f"请求错误: {e}")
except Exception as e:
print(f"发生未知错误: {e}")
return Links
if __name__ == '__main__':
Links = crawler('https://airunfive.github.io/')
for link in Links:
print(link)
# 结果如下,可以拉取博客所有的链接(展示部分):
https://airunfive.github.io/tags/web%E5%8E%9F%E7%90%86%E8%AE%B2%E8%A7%A3/
https://airunfive.github.io/2023/07/11/%E6%96%87%E4%BB%B6%E5%8C%85%E5%90%AB%E5%81%9A%E9%A2%98%E8%AE%B0%E5%BD%95/
http://airunfive.github.io/2023/01/30/%E5%A0%86%E6%BC%8F%E6%B4%9Eoffbyone/
http://airunfive.github.io/2023/04/25/upload-labs%E9%80%9A%E5%85%B3%E8%A7%A3%E6%9E%90/
http://airunfive.github.io/2022/11/01/%E6%B1%87%E7%BC%96%E6%8C%87%E4%BB%A4%E5%B0%8F%E7%BB%93/
http://airunfive.github.io/2023/07/28/%E5%BA%8F%E5%88%97%E5%8C%96%E4%B9%8Bpop%E9%93%BE%E6%9E%84%E9%80%A0/
https://airunfive.github.io/tags/web%E5%81%9A%E9%A2%98%E8%AE%B0%E5%BD%95/
http://airunfive.github.io/2023/07/10/buuctf%E5%88%B7%E9%A2%98%E8%AE%B0%E5%BD%951/
https://airunfive.github.io/archives/2022/page/2/
https://airunfive.github.io/2022/10/27/%E6%B5%85%E6%9E%90jmp%E6%8C%87%E4%BB%A4%E7%9A%84%E5%8E%9F%E7%90%86/
https://airunfive.github.io/tags/pwn-write-up-pwn%E5%8E%9F%E7%90%86%E8%AE%B2%E8%A7%A3/
http://airunfive.github.io/2023/07/11/%E5%91%BD%E4%BB%A4%E6%B3%A8%E5%85%A5%E5%81%9A%E9%A2%98%E8%AE%B0%E5%BD%95/
。。。。。。。
https://airunfive.github.io/tags/%E6%96%87%E4%BB%B6%E5%8C%85%E5%90%AB/
https://airunfive.github.io/2022/12/27/CGfsb-wp/
http://airunfive.github.io/2023/03/17/%E6%A0%BC%E5%BC%8F%E5%8C%96%E5%AD%97%E7%AC%A6%E4%B8%B2%E6%BC%8F%E6%B4%9E%E5%81%8F%E7%A7%BB%E7%9A%84%E4%BE%BF%E6%8D%B7%E7%AE%97%E6%B3%95/
https://airunfive.github.io/2022/09/25/%E7%99%BE%E5%BA%A6%E6%98%AF%E4%B8%AA%E5%A5%BD%E4%B8%9C%E8%A5%BF/
https://airunfive.github.io/2023/07/15/%E5%BA%8F%E5%88%97%E5%8C%96%E4%B8%8E%E5%8F%8D%E5%BA%8F%E5%88%97%E5%8C%96/
http://airunfive.github.io/2023/07/11/%E6%96%87%E4%BB%B6%E5%8C%85%E5%90%AB%E5%81%9A%E9%A2%98%E8%AE%B0%E5%BD%95/
http://airunfive.github.io/2023/01/05/%E6%89%93%E5%8D%A1%EF%BC%9A%E7%9B%B2%E6%89%93%E9%A2%98-warm-up/
http://airunfive.github.io/2023/08/24/phar%E5%8F%8D%E5%BA%8F%E5%88%97%E5%8C%96/
http://airunfive.github.io/2023/03/25/sqli-labs-1-4/
三、用BeautifulSoup模块实现爬虫
优点:用BeautifulSoup模块可以构建html树结构,可以更精准的匹配拉取链接
缺点:需要的时间比较长,因为每次BeautifulSoup都要对拉取的html网页先进行构造html树
1、当前页爬取图片
from bs4 import BeautifulSoup
import requests
import time
def crawler_png():
session = requests.Session()
resp = session.get('https://airunfive.github.io/')
http_tree = BeautifulSoup(resp.text, 'html.parser')
links = http_tree.find_all('img')
for link in links:
value = link.get('src')
if value and value.startswith('/'):
value = 'https://airunfive.github.io' + value
filename = time.strftime('%Y%m%d_%H%M%S_') + value.split('/')[-1]
with open("D:/vscode/security_python/crawler_test/images/"+filename, mode='wb') as f:
resp = session.get(value)
f.write(resp.content)
if __name__ == '__main__':
crawler_png()
2、递归爬取链接
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin
def crawler(url, Links=None):
if Links is None:
Links = set()
try:
new_links = []
session = requests.Session()
resp = session.get('https://airunfive.github.io/')
http_tree = BeautifulSoup(resp.text, 'html.parser')
links = http_tree.find_all('a')
for link in links:
value = link.get('href') # 先获取,因为不是所有的a标签都有href值,于是还要判断value是否为空
if value and value.startswith('/'):
value = urljoin(url, value)
if not link in Links:
new_links.append(value)# 单独记录新递归后增加的url
Links.add(value) # 将新递归的url添加到总url列表里,因为用的是集合所以会自动排除一样的url
length = len(new_links)
for i in range(length): # 仅仅只是递归新增加的url
url = new_links[i]
if not url.startswith('https://airunfive.github.io/'):
continue
crawler(url, Links) # 记录返回的总url
except requests.RequestException as e:
print(f"请求错误: {e}")
except Exception as e:
print(f"发生未知错误: {e}")
return Links
if __name__ == '__main__':
Links = crawler('https://airunfive.github.io')
for link in Links:
print(link)