1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
| import requests
# example-1:简单爬取网站图片 # if __name__ == "__main__": # url1 = "https://www.tommonkey.cn/img/ctfhub_rce/rce_1/1-1.PNG" # data_pic = requests.get(url=url1).content # 图片是二进制,所以用content # with open('./picture.jpg','wb') as fb: # fb.write(data_pic) # print("Task is complete!")
# example-2:爬取https://www.qiushibaike.com/imgrank/,使用正则表达式截取页面特定的内容 # import re # import os # if __name__ == "__main__": # if not os.path.exists("./picture"): # os.mkdir("./picture") # # url1 = "https://www.qiushibaike.com/imgrank/page/" # heared1 ={ # 'User-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36' # } # match_pic = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>' # 正则表达式匹配img的url地址部分 # # for i in range(1,2): # 选择下载的页数 # new_url = url1+"{}/".format(i) # data_pic = requests.get(url=new_url, headers=heared1).text # data_pic_list = re.findall(match_pic,data_pic,re.S) # print("正在下载第{}页图片内容!".format(i)) # # n = 1 # for p in data_pic_list: # pic_url = "https:"+p # download_pic = requests.get(url=pic_url,headers=heared1).content # filename = p.split("/")[-1] # path = "./picture/"+filename # with open(path,'wb') as fd: # fd.write(download_pic) # print("第{}张图片以下载完成!".format(n)) # n = n+1 # print("第{}页内容下载完成.......".format(i)) # print("The task is complete!")
# example-3:xpath解析实列,58同城爬取二手房名字信息。 # from lxml import etree # if __name__ == "__main__": # url1 = "https://hf.58.com/ershoufang/p1/" # header1 ={ # 'User-agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36" # } # response_page = requests.get(url=url1,headers=header1).text # obj_page = etree.HTML(response_page) # tag_content = obj_page.xpath('//div[@class="property-content-title"]/h3/text()') # print(tag_content) # fd = open('./info.txt',"w",encoding="utf-8") # for n in tag_content: # fd.write(n+"\n") # print("over!")
# example-4:爬取www.bing.com的图片. # import os # from lxml import etree # if __name__ == "__main__": # if not os.path.exists("./picture_down"): # os.mkdir("./picture_down") # url1 = "https://cn.bing.com/images/search?" # # 这里的参数param1可以修改,以此来获取更多的页请求 # param1 ={ # "q":"机甲", # "first":"1", # "tsc":"ImageBasicHover" # } # header1 ={ # "User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36" # } # content = requests.get(url=url1,params=param1,headers=header1).text # 请求html页面 # obj_content = etree.HTML(content) # pic_content = obj_content.xpath('//div[@id="mmComponent_images_1"]/ul/li//a[@class="iusc"]/@href') # 是以列表的形式返回 # pic_name = obj_content.xpath('//div[@id="mmComponent_images_1"]/ul/li//a[@class="iusc"]/@h') # 是以列表的形式返回 # num = 0 # for i in pic_content: # pic_name_fix = pic_name[num].split(".")[-2] # path_name = "./picture_down/" + pic_name_fix+".jpg" # 图片完整的保存路径,通过num来取出名字 # #fd = open(path_name,"wb") # new_url = "https://cn.bing.com"+i # 拼接成图片的url地址 # down_pic = requests.get(url=new_url, headers=header1).content # 以二进制请求响应并赋值给down_pic # with open(path_name,"wb") as fd: # fd.write(down_pic) # num = num+1 # print("第{}张下载完成!".format(num)) # # print("The task is complete!") # # 这里只做了一页的循环遍历下载,如果要下载很多页,只需要在此基础上在添加一个外部for循环并修改param1的参数即可 最后一个example-4通过爬虫下载下来的图片,不能正常打开,暂时没找到原因。待解决。
|
Gitalking ...