79122

爬虫实战1

<h1 id="re正则">re正则:

re.findall("规则",解析的数据,匹配的模式(re.S)) .*? ---->过滤 (.*?)=-->提取内容 <h1 id="json的使用">json的使用:

json - --->第三方的数据格式 json.dump() json.loads() -->json数据格式 - -转化为python数据 <h1 id="爬取多网页">爬取多网页:

import requests import re # 0 获取所有电影的url num = 0 for line in range(10): url = f'https://movie.douban.com/top250?start={num}&filter=' # 0 25 50 75 num += 25 # 1.发送请求 response = requests.get( url=url ) # 获取电影的名称与详情页地址 # movie_name = re.findall('<div class="item">.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>', response.text, re.S) movie_list = re.findall( '<div class="item">.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>.*?<span class="rating_num" property="v:average">(.*?)</span>.*?<span>(.*?)人评价</span>', response.text, re.S) # 循环 num = 1 with open('douban.txt', 'a', encoding='utf-8') as f: for line in movie_list: movie_url = line[0] movie_name = line[1] movie_point = line[2] movie_count = line[3] f.write(movie_url + '---' + movie_name + '---' + movie_point + '---' + movie_count + '\n') print('写入数据成功,爬虫程序结束...')

来源:博客园

作者:shaozheng

链接:https://www.cnblogs.com/shaozheng/p/11426076.html

Recommend