Quét web nghiên cứu học thuật bằng cách giải CAPTCHA

Cơ sở dữ liệu học thuật và cổng tạp chí sử dụng CAPTCHA để hạn chế quyền truy cập tự động. Các nhà nghiên cứu tiến hành đánh giá tài liệu, phân tích trắc lượng thư mục và nghiên cứu tổng hợp cần thu thập dữ liệu từ các nguồn này trên quy mô lớn. CaptchaAI tự động xử lý các thử thách CAPTCHA.

Nguồn học thuật và CAPTCHA

Nguồn	Loại CAPTCHA	Trình kích hoạt	dữ liệu
Học giả Google	reCAPTCHA v3	Truy vấn có khối lượng lớn	Trích dẫn, bài báo
PubMed	reCAPTCHA v2	Tìm kiếm lặp lại	Văn học y sinh
Web khoa học	Cloudflare Turnstile	Tải xuống hàng loạt	Số liệu trích dẫn
Scopus	reCAPTCHA v2	Hoạt động xuất khẩu	Dữ liệu thư mục
IEEE Xplore	reCAPTCHA v2	Tìm kiếm + tải xuống	Giấy tờ kỹ thuật
JSTOR	reCAPTCHA v2	Truy cập trang	Nhân văn/social khoa học

Trình thu thập dữ liệu trích dẫn

import requests
import time
import re
from bs4 import BeautifulSoup
import csv

CAPTCHAAI_KEY = "YOUR_API_KEY"
CAPTCHAAI_URL = "https://ocr.captchaai.com"

def solve_captcha(method, sitekey, pageurl, **kwargs):
    data = {
        "key": CAPTCHAAI_KEY, "method": method,
        "googlekey": sitekey, "pageurl": pageurl, "json": 1,
    }
    data.update(kwargs)
    resp = requests.post(f"{CAPTCHAAI_URL}/in.php", data=data)
    task_id = resp.json()["request"]
    for _ in range(60):
        time.sleep(5)
        result = requests.get(f"{CAPTCHAAI_URL}/res.php", params={
            "key": CAPTCHAAI_KEY, "action": "get",
            "id": task_id, "json": 1,
        })
        r = result.json()
        if r["request"] != "CAPCHA_NOT_READY":
            return r["request"]
    raise TimeoutError("Timeout")

class AcademicScraper:
    def __init__(self, proxy=None):
        self.session = requests.Session()
        if proxy:
            self.session.proxies = {"http": proxy, "https": proxy}
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 Chrome/126.0.0.0 Safari/537.36",
            "Accept-Language": "en-US,en;q=0.9",
        })

    def search_papers(self, search_url, query, max_pages=10):
        """Search academic database for papers matching query."""
        all_papers = []

        for page in range(max_pages):
            url = f"{search_url}?q={query}&start={page * 10}"
            resp = self.session.get(url, timeout=30)

            # Handle CAPTCHA
            if self._has_captcha(resp.text):
                resp = self._solve_and_retry(resp.text, url)

            papers = self._parse_results(resp.text)
            if not papers:
                break  # No more results

            all_papers.extend(papers)
            print(f"Page {page + 1}: {len(papers)} papers")
            time.sleep(5)  # Respectful delay

        return all_papers

    def get_paper_details(self, paper_url):
        """Get detailed metadata for a single paper."""
        resp = self.session.get(paper_url, timeout=30)

        if self._has_captcha(resp.text):
            resp = self._solve_and_retry(resp.text, paper_url)

        soup = BeautifulSoup(resp.text, "html.parser")
        return {
            "title": self._safe_text(soup, "h1, .article-title"),
            "authors": self._safe_text(soup, ".authors, .author-list"),
            "abstract": self._safe_text(soup, ".abstract, #abstract"),
            "doi": self._safe_text(soup, ".doi, [data-doi]"),
            "journal": self._safe_text(soup, ".journal-name, .publication"),
            "year": self._safe_text(soup, ".pub-date, .year"),
            "citations": self._safe_text(soup, ".citation-count, .cited-by"),
        }

    def export_to_csv(self, papers, filename):
        """Export collected papers to CSV."""
        if not papers:
            return
        keys = papers[0].keys()
        with open(filename, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=keys)
            writer.writeheader()
            writer.writerows(papers)
        print(f"Exported {len(papers)} papers to {filename}")

    def _has_captcha(self, html):
        return any(tag in html.lower() for tag in [
            'data-sitekey', 'g-recaptcha', 'cf-turnstile',
        ])

    def _solve_and_retry(self, html, url):
        match = re.search(r'data-sitekey="([^"]+)"', html)
        if not match:
            return self.session.get(url)

        sitekey = match.group(1)
        if 'cf-turnstile' in html:
            token = solve_captcha("turnstile", sitekey, url)
            return self.session.post(url, data={"cf-turnstile-response": token})
        else:
            token = solve_captcha("userrecaptcha", sitekey, url)
            return self.session.post(url, data={"g-recaptcha-response": token})

    def _parse_results(self, html):
        soup = BeautifulSoup(html, "html.parser")
        papers = []
        for item in soup.select(".gs_r, .search-result, article.result"):
            title_el = item.select_one("h3 a, .result-title a")
            if title_el:
                papers.append({
                    "title": title_el.get_text(strip=True),
                    "url": title_el.get("href", ""),
                    "snippet": self._safe_text(item, ".gs_rs, .abstract-snippet"),
                    "authors": self._safe_text(item, ".gs_a, .author-info"),
                })
        return papers

    def _safe_text(self, soup, selector):
        el = soup.select_one(selector)
        return el.get_text(strip=True) if el else ""

# Usage — Literature review
scraper = AcademicScraper(
    proxy="http://user:pass@residential.proxy.com:5000"
)

papers = scraper.search_papers(
    "https://scholar.example.com/scholar",
    query="machine learning CAPTCHA solving",
    max_pages=5,
)

# Get details for top papers
detailed = []
for paper in papers[:20]:
    if paper["url"]:
        detail = scraper.get_paper_details(paper["url"])
        detailed.append(detail)
        time.sleep(3)

scraper.export_to_csv(detailed, "literature_review.csv")

Phân tích thư mục

def bibliometric_analysis(scraper, seed_papers, depth=2):
    """Follow citations to build a citation network."""
    visited = set()
    network = []

    def _crawl(paper_url, current_depth):
        if current_depth > depth or paper_url in visited:
            return
        visited.add(paper_url)

        try:
            details = scraper.get_paper_details(paper_url)
            network.append(details)

            # Follow "cited by" links
            resp = scraper.session.get(f"{paper_url}/citations", timeout=30)
            if scraper._has_captcha(resp.text):
                resp = scraper._solve_and_retry(resp.text, f"{paper_url}/citations")

            citations = scraper._parse_results(resp.text)
            for cite in citations[:5]:  # Limit breadth
                if cite["url"]:
                    _crawl(cite["url"], current_depth + 1)
                    time.sleep(3)

        except Exception as e:
            print(f"Error crawling {paper_url}: {e}")

    for paper in seed_papers:
        _crawl(paper["url"], 0)

    return network

Giới hạn tỷ lệ cho các trang web học thuật

Nguồn	Độ trễ được đề xuất	Số trang tối đa/Hour
Học giả Google	10-15 giây	40-50
PubMed	3-5 giây	100
Web khoa học	5-10 giây	60
Scopus	5-10 giây	60
IEEE	3-5 giây	100
JSTOR	5-10 giây	60

Các trang web học thuật cấm IP một cách nhanh chóng. Sử dụng độ trễ bảo thủ.

Khắc phục sự cố

Vấn đề	Nguyên nhân	Cách xử lý
CAPTCHA trên mọi tìm kiếm	Địa chỉ IP được gắn cờ của trang web học thuật	Chuyển proxy, tăng độ trễ lên hơn 15 giây
Không có kết quả trả về	Thay vào đó, trang CAPTCHA đã được trả lại	Kiểm tra CAPTCHA trước khi phân tích cú pháp
Tóm tắt thiếu	Đằng sau bức tường phí	Sử dụng proxy tổ chức hoặc truy cập mở
Học giả chặn IP	Đã vượt quá giới hạn tỷ lệ	Đợi 30 phút, sử dụng IP khác
Xuất khẩu hạn chế	Giới hạn trang web tải xuống số lượng lớn	Tải xuống theo đợt nhỏ hơn

Câu hỏi thường gặp

Việc cạo cơ sở dữ liệu học thuật có được phép không?

Siêu dữ liệu công khai (tiêu đề, tác giả, tóm tắt) thường có thể truy cập được. Quyền truy cập toàn văn phụ thuộc vào giấy phép. PubMed hỗ trợ rõ ràng việc truy cập theo chương trình thông qua API tiện ích điện tử của họ. Luôn ưu tiên các API chính thức khi có sẵn.

Làm cách nào để tránh bị chặn trên Google Scholar?

Sử dụng độ trễ 10-15 giây giữa các yêu cầu, luân phiên đa dạng nguồn yêu cầu và giới hạn ở 50 truy vấn mỗi giờ. Scholar rất tích cực trong việc chặn quyền truy cập tự động.