网易首页 > 网易号 > 正文 申请入驻

用 Java 爬美女图片,这个厉害了。。

0
分享至

作者:Victor.Chang

原文:blog.csdn.net/qq_35402412/article/details/113627625

第1-100期:

目的

爬取搜狗图片上千张美女图片并下载到本地

准备工作

爬取地址:https://pic.sogou.com/pics?query=%E7%BE%8E%E5%A5%B3

分析

打开上面的地址,按F12开发者工具 - NetWork - XHR - 页面往下滑动XHR栏出现请求信息如下:

Request URL :https://pic.sogou.com/napi/pc/searchList?mode=1&start=48&xml_len=48&query=%E7%BE%8E%E5%A5%B3

分析这段请求URL的主要几个参数:

start=48 表示从第48张图片开始检索

xml_len=48 从地48张往后获取48张图片

query=?搜索关键词(例:美女,这里浏览器自动做了转码,不影响我们使用)

点击Respose,找个JSON格式器辅助过去看看。

JSON格式:https://www.bejson.com/

分析Respose返回的信息,可以发现我们想要的图片地址放在 picUrl里,

思路

通过以上分析,不难实现下载方法,思路如下:

  1. 设置URL请求参数

  2. 访问URL请求,获取图片地址

  3. 图片地址存入List

  4. 遍历List,使用线程池下载到本地

代码

SougouImgProcessor.java 爬取图片类

import com.alibaba.fastjson.JSONObject;
import us.codecraft.webmagic.utils.HttpClientUtils;
import victor.chang.crawler.pipeline.SougouImgPipeline;

import java.util.ArrayList;
import java.util.List;

/**
* A simple PageProcessor.
* @author code4crafter@gmail.com

* @since 0.1.0
*/
public class SougouImgProcessor {

private String url;
private SougouImgPipeline pipeline;
private List dataList;
private List urlList;
private String word;

public SougouImgProcessor(String url,String word) {
this.url = url;
this.word = word;
this.pipeline = new SougouImgPipeline();
this.dataList = new ArrayList<>();
this.urlList = new ArrayList<>();
}

public void process(int idx, int size) {
String res = HttpClientUtils.get(String.format(this.url, idx, size, this.word));
JSONObject object = JSONObject.parseObject(res);
List items = (List)((JSONObject)object.get("data")).get("items");
for(JSONObject item : items){
this.urlList.add(item.getString("picUrl"));
}
this.dataList.addAll(items);
}

// 下载
public void pipelineData(){
// 多线程
pipeline.processSync(this.urlList, this.word);
}

public static void main(String[] args) {
String url = "https://pic.sogou.com/napi/pc/searchList?mode=1&start=%s&xml_len=%s&query=%s";
SougouImgProcessor processor = new SougouImgProcessor(url,"美女");

int start = 0, size = 50, limit = 1000; // 定义爬取开始索引、每次爬取数量、总共爬取数量

for(int i=start;ilimit;i+=size)
processor.process(i, size);

processor.pipelineData();

}

}

SougouImgPipeline.java 图片下载类

import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.List;
import java.util.Objects;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;

/**
* Store results in files.

* @author code4crafter@gmail.com

* @since 0.1.0
*/
public class SougouImgPipeline {

private String extension = ".jpg";
private String path;

private volatile AtomicInteger suc;
private volatile AtomicInteger fails;

public SougouImgPipeline() {
setPath("E:/pipeline/sougou");
suc = new AtomicInteger();
fails = new AtomicInteger();
}

public SougouImgPipeline(String path) {
setPath(path);
suc = new AtomicInteger();
fails = new AtomicInteger();
}

public SougouImgPipeline(String path, String extension) {
setPath(path);
this.extension = extension;
suc = new AtomicInteger();
fails = new AtomicInteger();
}

public void setPath(String path) {
this.path = path;
}

/**
* 下载
* @param url
* @param cate
* @throws Exception
*/
private void downloadImg(String url, String cate, String name) throws Exception {
String path = this.path + "/" + cate + "/";
File dir = new File(path);
if (!dir.exists()) { // 目录不存在则创建目录
dir.mkdirs();
}
String realExt = url.substring(url.lastIndexOf(".")); // 获取扩展名
String fileName = name + realExt;
fileName = fileName.replace("-", "");
String filePath = path + fileName;
File img = new File(filePath);
if(img.exists()){ // 若文件之前已经下载过,则跳过
System.out.println(String.format("文件%s已存在本地目录",fileName));
return;
}

URLConnection con = new URL(url).openConnection();
con.setConnectTimeout(5000);
con.setReadTimeout(5000);
InputStream inputStream = con.getInputStream();
byte[] bs = new byte[1024];

File file = new File(filePath);
FileOutputStream os = new FileOutputStream(file, true);
// 开始读取 写入
int len;
while ((len = inputStream.read(bs)) != -1) {
os.write(bs, 0, len);
}
System.out.println("picUrl: " + url);
System.out.println(String.format("正在下载第%s张图片", suc.getAndIncrement()));
}

/**
* 单线程处理
*
* @param data
* @param word
*/
public void process(List data, String word) {
long start = System.currentTimeMillis();
for (String picUrl : data) {
if (picUrl == null)
continue;
try {
downloadImg(picUrl, word, picUrl);
} catch (Exception e) {
fails.incrementAndGet();
}
}
System.out.println("下载成功: " + suc.get());
System.out.println("下载失败: " + fails.get());
long end = System.currentTimeMillis();
System.out.println("耗时:" + (end - start) / 1000 + "秒");
}

/**
* 多线程处理
*
* @param data
* @param word
*/
public void processSync(List data, String word) {
long start = System.currentTimeMillis();
int count = 0;
ExecutorService executorService = Executors.newCachedThreadPool(); // 创建缓存线程池
for (int i=0;i String picUrl = data.get(i);
if (picUrl == null)
continue;
String name = "";
if(i<10){
name="000"+i;
}else if(i<100){
name="00"+i;
}else if(i<1000){
name="0"+i;
}
String finalName = name;
executorService.execute(() -> {
try {
downloadImg(picUrl, word, finalName);
} catch (Exception e) {
fails.incrementAndGet();
}
});
count++;
}
executorService.shutdown();
try {
if (!executorService.awaitTermination(60, TimeUnit.SECONDS)) {
// 超时的时候向线程池中所有的线程发出中断(interrupted)。
// executorService.shutdownNow();
}
System.out.println("AwaitTermination Finished");
System.out.println("共有URL: "+data.size());
System.out.println("下载成功: " + suc);
System.out.println("下载失败: " + fails);

File dir = new File(this.path + "/" + word + "/");
int len = Objects.requireNonNull(dir.list()).length;
System.out.println("当前共有文件: "+len);

long end = System.currentTimeMillis();
System.out.println("耗时:" + (end - start) / 1000.0 + "秒");
} catch (InterruptedException e) {
e.printStackTrace();
}

}

/**
* 多线程分段处理
*
* @param data
* @param word
* @param threadNum
*/
public void processSync2(List data, final String word, int threadNum) {
if (data.size() < threadNum) {
process(data, word);
} else {
ExecutorService executorService = Executors.newCachedThreadPool();
int num = data.size() / threadNum; //每段要处理的数量
for (int i = 0; i < threadNum; i++) {
int start = i * num;
int end = (i + 1) * num;
if (i == threadNum - 1) {
end = data.size();
}
final List cutList = data.subList(start, end);
executorService.execute(() -> process(cutList, word));
}
executorService.shutdown();
}
}

}

HttpClientUtils.java http请求工具类

import org.apache.http.Header;
import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpUriRequest;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.conn.ssl.TrustStrategy;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.ssl.SSLContextBuilder;
import org.apache.http.util.EntityUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import javax.net.ssl.HostnameVerifier;
import javax.net.ssl.SSLContext;
import javax.net.ssl.SSLSession;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.security.GeneralSecurityException;
import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/**
* @author code4crafter@gmail.com
* Date: 17/3/27
*/
public abstract class HttpClientUtils {

public static Map> convertHeaders(Header[] headers) {
Map> results = new HashMap>();
for (Header header : headers) {
List list = results.get(header.getName());
if (list == null) {
list = new ArrayList();
results.put(header.getName(), list);
}
list.add(header.getValue());
}
return results;
}

/**
* http的get请求
* @param url
*/
public static String get(String url) {
return get(url, "UTF-8");
}

public static Logger logger = LoggerFactory.getLogger(HttpClientUtils.class);

/**
* http的get请求
* @param url
*/
public static String get(String url, String charset) {
HttpGet httpGet = new HttpGet(url);
return executeRequest(httpGet, charset);
}

/**
* http的get请求,增加异步请求头参数
* @param url
*/
public static String ajaxGet(String url) {
return ajaxGet(url, "UTF-8");
}

/**
* http的get请求,增加异步请求头参数
*
* @param url
*/
public static String ajaxGet(String url, String charset) {
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("X-Requested-With", "XMLHttpRequest");
return executeRequest(httpGet, charset);
}

/**
* @param url
* @return
*/
public static String ajaxGet(CloseableHttpClient httpclient, String url) {
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("X-Requested-With", "XMLHttpRequest");
return executeRequest(httpclient, httpGet, "UTF-8");
}

/**
* http的post请求,传递map格式参数
*/
public static String post(String url, Map dataMap) {
return post(url, dataMap, "UTF-8");
}

/**
* http的post请求,传递map格式参数
*/
public static String post(String url, Map dataMap, String charset) {
HttpPost httpPost = new HttpPost(url);
try {
if (dataMap != null) {
List nvps = new ArrayList();
for (Map.Entry entry : dataMap.entrySet()) {
nvps.add(new BasicNameValuePair(entry.getKey(), entry.getValue()));
}
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(nvps, charset);
formEntity.setContentEncoding(charset);
httpPost.setEntity(formEntity);
}
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return executeRequest(httpPost, charset);
}

/**
* http的post请求,增加异步请求头参数,传递map格式参数
*/
public static String ajaxPost(String url, Map dataMap) {
return ajaxPost(url, dataMap, "UTF-8");
}

/**
* http的post请求,增加异步请求头参数,传递map格式参数
*/
public static String ajaxPost(String url, Map dataMap, String charset) {
HttpPost httpPost = new HttpPost(url);
httpPost.setHeader("X-Requested-With", "XMLHttpRequest");
try {
if (dataMap != null) {
List nvps = new ArrayList();
for (Map.Entry entry : dataMap.entrySet()) {
nvps.add(new BasicNameValuePair(entry.getKey(), entry.getValue()));
}
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(nvps, charset);
formEntity.setContentEncoding(charset);
httpPost.setEntity(formEntity);
}
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return executeRequest(httpPost, charset);
}

/**
* http的post请求,增加异步请求头参数,传递json格式参数
*/
public static String ajaxPostJson(String url, String jsonString) {
return ajaxPostJson(url, jsonString, "UTF-8");
}

/**
* http的post请求,增加异步请求头参数,传递json格式参数
*/
public static String ajaxPostJson(String url, String jsonString, String charset) {
HttpPost httpPost = new HttpPost(url);
httpPost.setHeader("X-Requested-With", "XMLHttpRequest");

StringEntity stringEntity = new StringEntity(jsonString, charset);// 解决中文乱码问题
stringEntity.setContentEncoding(charset);
stringEntity.setContentType("application/json");
httpPost.setEntity(stringEntity);
return executeRequest(httpPost, charset);
}

/**
* 执行一个http请求,传递HttpGet或HttpPost参数
*/
public static String executeRequest(HttpUriRequest httpRequest) {
return executeRequest(httpRequest, "UTF-8");
}

/**
* 执行一个http请求,传递HttpGet或HttpPost参数
*/
public static String executeRequest(HttpUriRequest httpRequest, String charset) {
CloseableHttpClient httpclient;
if ("https".equals(httpRequest.getURI().getScheme())) {
httpclient = createSSLInsecureClient();
} else {
httpclient = HttpClients.createDefault();
}
String result = "";
try {
try {
CloseableHttpResponse response = httpclient.execute(httpRequest);
HttpEntity entity = null;
try {
entity = response.getEntity();
result = EntityUtils.toString(entity, charset);
} finally {
EntityUtils.consume(entity);
response.close();
}
} finally {
httpclient.close();
}
} catch (IOException ex) {
ex.printStackTrace();
}
return result;
}

public static String executeRequest(CloseableHttpClient httpclient, HttpUriRequest httpRequest, String charset) {
String result = "";
try {
try {
CloseableHttpResponse response = httpclient.execute(httpRequest);
HttpEntity entity = null;
try {
entity = response.getEntity();
result = EntityUtils.toString(entity, charset);
} finally {
EntityUtils.consume(entity);
response.close();
}
} finally {
httpclient.close();
}
} catch (IOException ex) {
ex.printStackTrace();
}
return result;
}

/**
* 创建 SSL连接
*/
public static CloseableHttpClient createSSLInsecureClient() {
try {
SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(new TrustStrategy() {
@Override
public boolean isTrusted(X509Certificate[] chain, String authType) throws CertificateException {
return true;
}
}).build();
SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext, new HostnameVerifier() {
@Override
public boolean verify(String hostname, SSLSession session) {
return true;
}
});
return HttpClients.custom().setSSLSocketFactory(sslsf).build();
} catch (GeneralSecurityException ex) {
throw new RuntimeException(ex);
}
}
}
运行

由于网络等原因,我们发现并不能全部下载成功,不过可以多次运行尝试,可以实现较高的下载成功率。

666,厉害了。。

特别声明:以上内容(如有图片或视频亦包括在内)为自媒体平台“网易号”用户上传并发布,本平台仅提供信息存储服务。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.

相关推荐
热点推荐
2-1!欧洲杯第1场绝杀:荷兰击碎24年魔咒,1米97中锋救主

2-1!欧洲杯第1场绝杀:荷兰击碎24年魔咒,1米97中锋救主

叶青足球世界
2024-06-16 22:53:17
曝45岁伏明霞离婚,净身出户原因揭晓,71岁百亿丈夫只说6个字

曝45岁伏明霞离婚,净身出户原因揭晓,71岁百亿丈夫只说6个字

深度知局
2024-05-20 19:25:53
解放军中将在京突然去世,两女儿定居美国,亲弟弟关联25家公司

解放军中将在京突然去世,两女儿定居美国,亲弟弟关联25家公司

求实者
2024-06-12 14:30:22
缺席雅尔塔会议,中国的代价有多大?

缺席雅尔塔会议,中国的代价有多大?

凭阑听史
2024-06-15 16:14:30
大消息,沙特终止与美石油美元协议!国际油价创4月以来最大周涨幅

大消息,沙特终止与美石油美元协议!国际油价创4月以来最大周涨幅

金融界
2024-06-16 08:00:08
G7发表联合声明,对中俄提出4个要求,普京不答应,中方态度坚决

G7发表联合声明,对中俄提出4个要求,普京不答应,中方态度坚决

涛涛生活搞笑
2024-06-16 23:54:39
西方终于察觉不对!普京“靠山”浮出水面?北约加起来都打不过!

西方终于察觉不对!普京“靠山”浮出水面?北约加起来都打不过!

文雅笔墨
2024-06-16 15:29:46
出大事了,A股竟然把大家的养老金给跌没了

出大事了,A股竟然把大家的养老金给跌没了

流苏晚晴
2024-06-16 21:24:32
小罗:我受够了 这支巴西队几乎都是平庸的普通球员

小罗:我受够了 这支巴西队几乎都是平庸的普通球员

罗克
2024-06-16 15:25:04
李连杰利智上山修行120天,称为了世界和平,21岁小女儿乖巧陪同

李连杰利智上山修行120天,称为了世界和平,21岁小女儿乖巧陪同

开开森森
2024-06-16 07:24:44
赵丽颖古早黑历史曝光,惊人往事让人不敢相信,疑似没文化还当三

赵丽颖古早黑历史曝光,惊人往事让人不敢相信,疑似没文化还当三

花哥扒娱乐
2024-04-18 22:17:33
建议中年男人:包包尽量别背“LV、Gucci”,换成另外3种更有格调

建议中年男人:包包尽量别背“LV、Gucci”,换成另外3种更有格调

潮人志Fashion
2024-06-16 08:27:10
为什么现在没人关心油价了?

为什么现在没人关心油价了?

汽车扒壹扒
2024-06-14 22:07:30
范丞丞主演电影累计票房破50亿

范丞丞主演电影累计票房破50亿

界面新闻
2024-06-14 14:08:34
深圳楼市全军覆没,拉了中山楼市后腿,中山南朗房价降了4000元

深圳楼市全军覆没,拉了中山楼市后腿,中山南朗房价降了4000元

有事问彭叔
2024-06-15 22:06:02
中超最新积分战报:申花夺榜首,武汉三镇1-0险胜,沧州雄狮落败

中超最新积分战报:申花夺榜首,武汉三镇1-0险胜,沧州雄狮落败

足球狗说
2024-06-16 21:58:02
“中国技术不如日本?”中国高铁一公里1万度电,日本只要43度?

“中国技术不如日本?”中国高铁一公里1万度电,日本只要43度?

番茄说史聊
2024-06-15 22:01:17
记者:魔术能够给克莱一份8190万美元的合同,报价比勇士还高

记者:魔术能够给克莱一份8190万美元的合同,报价比勇士还高

懂球帝
2024-06-16 19:10:33
新《公司法》7月1日正式施行 允许设立一人股份有限公司

新《公司法》7月1日正式施行 允许设立一人股份有限公司

上游新闻
2024-06-16 11:57:17
媒体人:刚把一份18人的华裔归化名单交给足协的人

媒体人:刚把一份18人的华裔归化名单交给足协的人

懂球帝
2024-06-16 16:41:09
2024-06-17 03:16:49
会呼吸的Coder
会呼吸的Coder
科技改变世界
466文章数 1813关注度
往期回顾 全部

科技要闻

iPhone 16会杀死大模型APP吗?

头条要闻

冷藏货车违规乘人致8人窒息后遇难 河南叶县通报

头条要闻

冷藏货车违规乘人致8人窒息后遇难 河南叶县通报

体育要闻

没人永远年轻 但青春如此无敌还是离谱了些

娱乐要闻

上影节红毯:倪妮好松弛,娜扎吸睛

财经要闻

打断妻子多根肋骨 上市公司创始人被公诉

汽车要闻

售17.68万-21.68万元 极狐阿尔法S5正式上市

态度原创

艺术
本地
数码
房产
亲子

艺术要闻

穿越时空的艺术:《马可·波罗》AI沉浸影片探索人类文明

本地新闻

粽情一夏|海河龙舟赛,竟然成了外国人的大party!

数码要闻

PCIe 5.0 SSD终于要便宜了!群联E31T主控无缓存能跑12GB/s

房产要闻

万华对面!海口今年首宗超百亩宅地,重磅挂出!

亲子要闻

玩这个游戏的都是勇士

无障碍浏览 进入关怀版