Initial commit
This commit is contained in:
725
skills/citation-management/references/google_scholar_search.md
Normal file
725
skills/citation-management/references/google_scholar_search.md
Normal file
@@ -0,0 +1,725 @@
|
||||
# Google Scholar Search Guide
|
||||
|
||||
Comprehensive guide to searching Google Scholar for academic papers, including advanced search operators, filtering strategies, and metadata extraction.
|
||||
|
||||
## Overview
|
||||
|
||||
Google Scholar provides the most comprehensive coverage of academic literature across all disciplines:
|
||||
- **Coverage**: 100+ million scholarly documents
|
||||
- **Scope**: All academic disciplines
|
||||
- **Content types**: Journal articles, books, theses, conference papers, preprints, patents, court opinions
|
||||
- **Citation tracking**: "Cited by" links for forward citation tracking
|
||||
- **Accessibility**: Free to use, no account required
|
||||
|
||||
## Basic Search
|
||||
|
||||
### Simple Keyword Search
|
||||
|
||||
Search for papers containing specific terms anywhere in the document (title, abstract, full text):
|
||||
|
||||
```
|
||||
CRISPR gene editing
|
||||
machine learning protein folding
|
||||
climate change impact agriculture
|
||||
quantum computing algorithms
|
||||
```
|
||||
|
||||
**Tips**:
|
||||
- Use specific technical terms
|
||||
- Include key acronyms and abbreviations
|
||||
- Start broad, then refine
|
||||
- Check spelling of technical terms
|
||||
|
||||
### Exact Phrase Search
|
||||
|
||||
Use quotation marks to search for exact phrases:
|
||||
|
||||
```
|
||||
"deep learning"
|
||||
"CRISPR-Cas9"
|
||||
"systematic review"
|
||||
"randomized controlled trial"
|
||||
```
|
||||
|
||||
**When to use**:
|
||||
- Technical terms that must appear together
|
||||
- Proper names
|
||||
- Specific methodologies
|
||||
- Exact titles
|
||||
|
||||
## Advanced Search Operators
|
||||
|
||||
### Author Search
|
||||
|
||||
Find papers by specific authors:
|
||||
|
||||
```
|
||||
author:LeCun
|
||||
author:"Geoffrey Hinton"
|
||||
author:Church synthetic biology
|
||||
```
|
||||
|
||||
**Variations**:
|
||||
- Single last name: `author:Smith`
|
||||
- Full name in quotes: `author:"Jane Smith"`
|
||||
- Author + topic: `author:Doudna CRISPR`
|
||||
|
||||
**Tips**:
|
||||
- Authors may publish under different name variations
|
||||
- Try with and without middle initials
|
||||
- Consider name changes (marriage, etc.)
|
||||
- Use quotation marks for full names
|
||||
|
||||
### Title Search
|
||||
|
||||
Search only in article titles:
|
||||
|
||||
```
|
||||
intitle:transformer
|
||||
intitle:"attention mechanism"
|
||||
intitle:review climate change
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Finding papers specifically about a topic
|
||||
- More precise than full-text search
|
||||
- Reduces irrelevant results
|
||||
- Good for finding reviews or methods
|
||||
|
||||
### Source (Journal) Search
|
||||
|
||||
Search within specific journals or conferences:
|
||||
|
||||
```
|
||||
source:Nature
|
||||
source:"Nature Communications"
|
||||
source:NeurIPS
|
||||
source:"Journal of Machine Learning Research"
|
||||
```
|
||||
|
||||
**Applications**:
|
||||
- Track publications in top-tier venues
|
||||
- Find papers in specialized journals
|
||||
- Identify conference-specific work
|
||||
- Verify publication venue
|
||||
|
||||
### Exclusion Operator
|
||||
|
||||
Exclude terms from results:
|
||||
|
||||
```
|
||||
machine learning -survey
|
||||
CRISPR -patent
|
||||
climate change -news
|
||||
deep learning -tutorial -review
|
||||
```
|
||||
|
||||
**Common exclusions**:
|
||||
- `-survey`: Exclude survey papers
|
||||
- `-review`: Exclude review articles
|
||||
- `-patent`: Exclude patents
|
||||
- `-book`: Exclude books
|
||||
- `-news`: Exclude news articles
|
||||
- `-tutorial`: Exclude tutorials
|
||||
|
||||
### OR Operator
|
||||
|
||||
Search for papers containing any of multiple terms:
|
||||
|
||||
```
|
||||
"machine learning" OR "deep learning"
|
||||
CRISPR OR "gene editing"
|
||||
"climate change" OR "global warming"
|
||||
```
|
||||
|
||||
**Best practices**:
|
||||
- OR must be uppercase
|
||||
- Combine synonyms
|
||||
- Include acronyms and spelled-out versions
|
||||
- Use with exact phrases
|
||||
|
||||
### Wildcard Search
|
||||
|
||||
Use asterisk (*) as wildcard for unknown words:
|
||||
|
||||
```
|
||||
"machine * learning"
|
||||
"CRISPR * editing"
|
||||
"* neural network"
|
||||
```
|
||||
|
||||
**Note**: Limited wildcard support in Google Scholar compared to other databases.
|
||||
|
||||
## Advanced Filtering
|
||||
|
||||
### Year Range
|
||||
|
||||
Filter by publication year:
|
||||
|
||||
**Using interface**:
|
||||
- Click "Since [year]" on left sidebar
|
||||
- Select custom range
|
||||
|
||||
**Using search operators**:
|
||||
```
|
||||
# Not directly in search query
|
||||
# Use interface or URL parameters
|
||||
```
|
||||
|
||||
**In script**:
|
||||
```bash
|
||||
python scripts/search_google_scholar.py "quantum computing" \
|
||||
--year-start 2020 \
|
||||
--year-end 2024
|
||||
```
|
||||
|
||||
### Sorting Options
|
||||
|
||||
**By relevance** (default):
|
||||
- Google's algorithm determines relevance
|
||||
- Considers citations, author reputation, publication venue
|
||||
- Generally good for most searches
|
||||
|
||||
**By date**:
|
||||
- Most recent papers first
|
||||
- Good for fast-moving fields
|
||||
- May miss highly cited older papers
|
||||
- Click "Sort by date" in interface
|
||||
|
||||
**By citation count** (via script):
|
||||
```bash
|
||||
python scripts/search_google_scholar.py "transformers" \
|
||||
--sort-by citations \
|
||||
--limit 50
|
||||
```
|
||||
|
||||
### Language Filtering
|
||||
|
||||
**In interface**:
|
||||
- Settings → Languages
|
||||
- Select preferred languages
|
||||
|
||||
**Default**: English and papers with English abstracts
|
||||
|
||||
## Search Strategies
|
||||
|
||||
### Finding Seminal Papers
|
||||
|
||||
Identify highly influential papers in a field:
|
||||
|
||||
1. **Search by topic** with broad terms
|
||||
2. **Sort by citations** (most cited first)
|
||||
3. **Look for review articles** for comprehensive overviews
|
||||
4. **Check publication dates** for foundational vs recent work
|
||||
|
||||
**Example**:
|
||||
```
|
||||
"generative adversarial networks"
|
||||
# Sort by citations
|
||||
# Top results: original GAN paper (Goodfellow et al., 2014), key variants
|
||||
```
|
||||
|
||||
### Finding Recent Work
|
||||
|
||||
Stay current with latest research:
|
||||
|
||||
1. **Search by topic**
|
||||
2. **Filter to recent years** (last 1-2 years)
|
||||
3. **Sort by date** for newest first
|
||||
4. **Set up alerts** for ongoing tracking
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
python scripts/search_google_scholar.py "AlphaFold protein structure" \
|
||||
--year-start 2023 \
|
||||
--year-end 2024 \
|
||||
--limit 50
|
||||
```
|
||||
|
||||
### Finding Review Articles
|
||||
|
||||
Get comprehensive overviews of a field:
|
||||
|
||||
```
|
||||
intitle:review "machine learning"
|
||||
"systematic review" CRISPR
|
||||
intitle:survey "natural language processing"
|
||||
```
|
||||
|
||||
**Indicators**:
|
||||
- "review", "survey", "perspective" in title
|
||||
- Often highly cited
|
||||
- Published in review journals (Nature Reviews, Trends, etc.)
|
||||
- Comprehensive reference lists
|
||||
|
||||
### Citation Chain Search
|
||||
|
||||
**Forward citations** (papers citing a key paper):
|
||||
1. Find seminal paper
|
||||
2. Click "Cited by X"
|
||||
3. See all papers that cite it
|
||||
4. Identify how field has developed
|
||||
|
||||
**Backward citations** (references in a key paper):
|
||||
1. Find recent review or important paper
|
||||
2. Check its reference list
|
||||
3. Identify foundational work
|
||||
4. Trace development of ideas
|
||||
|
||||
**Example workflow**:
|
||||
```
|
||||
# Find original transformer paper
|
||||
"Attention is all you need" author:Vaswani
|
||||
|
||||
# Check "Cited by 120,000+"
|
||||
# See evolution: BERT, GPT, T5, etc.
|
||||
|
||||
# Check references in original paper
|
||||
# Find RNN, LSTM, attention mechanism origins
|
||||
```
|
||||
|
||||
### Comprehensive Literature Search
|
||||
|
||||
For thorough coverage (e.g., systematic reviews):
|
||||
|
||||
1. **Generate synonym list**:
|
||||
- Main terms + alternatives
|
||||
- Acronyms + spelled out
|
||||
- US vs UK spelling
|
||||
|
||||
2. **Use OR operators**:
|
||||
```
|
||||
("machine learning" OR "deep learning" OR "neural networks")
|
||||
```
|
||||
|
||||
3. **Combine multiple concepts**:
|
||||
```
|
||||
("machine learning" OR "deep learning") ("drug discovery" OR "drug development")
|
||||
```
|
||||
|
||||
4. **Search without date filters** initially:
|
||||
- Get total landscape
|
||||
- Filter later if too many results
|
||||
|
||||
5. **Export results** for systematic analysis:
|
||||
```bash
|
||||
python scripts/search_google_scholar.py \
|
||||
'"machine learning" OR "deep learning" drug discovery' \
|
||||
--limit 500 \
|
||||
--output comprehensive_search.json
|
||||
```
|
||||
|
||||
## Extracting Citation Information
|
||||
|
||||
### From Google Scholar Results Page
|
||||
|
||||
Each result shows:
|
||||
- **Title**: Paper title (linked to full text if available)
|
||||
- **Authors**: Author list (often truncated)
|
||||
- **Source**: Journal/conference, year, publisher
|
||||
- **Cited by**: Number of citations + link to citing papers
|
||||
- **Related articles**: Link to similar papers
|
||||
- **All versions**: Different versions of the same paper
|
||||
|
||||
### Export Options
|
||||
|
||||
**Manual export**:
|
||||
1. Click "Cite" under paper
|
||||
2. Select BibTeX format
|
||||
3. Copy citation
|
||||
|
||||
**Limitations**:
|
||||
- One paper at a time
|
||||
- Manual process
|
||||
- Time-consuming for many papers
|
||||
|
||||
**Automated export** (using script):
|
||||
```bash
|
||||
# Search and export to BibTeX
|
||||
python scripts/search_google_scholar.py "quantum computing" \
|
||||
--limit 50 \
|
||||
--format bibtex \
|
||||
--output quantum_papers.bib
|
||||
```
|
||||
|
||||
### Metadata Available
|
||||
|
||||
From Google Scholar you can typically extract:
|
||||
- Title
|
||||
- Authors (may be incomplete)
|
||||
- Year
|
||||
- Source (journal/conference)
|
||||
- Citation count
|
||||
- Link to full text (when available)
|
||||
- Link to PDF (when available)
|
||||
|
||||
**Note**: Metadata quality varies:
|
||||
- Some fields may be missing
|
||||
- Author names may be incomplete
|
||||
- Need to verify with DOI lookup for accuracy
|
||||
|
||||
## Rate Limiting and Access
|
||||
|
||||
### Rate Limits
|
||||
|
||||
Google Scholar has rate limiting to prevent automated scraping:
|
||||
|
||||
**Symptoms of rate limiting**:
|
||||
- CAPTCHA challenges
|
||||
- Temporary IP blocks
|
||||
- 429 "Too Many Requests" errors
|
||||
|
||||
**Best practices**:
|
||||
1. **Add delays between requests**: 2-5 seconds minimum
|
||||
2. **Limit query volume**: Don't search hundreds of queries rapidly
|
||||
3. **Use scholarly library**: Handles rate limiting automatically
|
||||
4. **Rotate User-Agents**: Appear as different browsers
|
||||
5. **Consider proxies**: For large-scale searches (use ethically)
|
||||
|
||||
**In our scripts**:
|
||||
```python
|
||||
# Automatic rate limiting built in
|
||||
time.sleep(random.uniform(3, 7)) # Random delay 3-7 seconds
|
||||
```
|
||||
|
||||
### Ethical Considerations
|
||||
|
||||
**DO**:
|
||||
- Respect rate limits
|
||||
- Use reasonable delays
|
||||
- Cache results (don't re-query)
|
||||
- Use official APIs when available
|
||||
- Attribute data properly
|
||||
|
||||
**DON'T**:
|
||||
- Scrape aggressively
|
||||
- Use multiple IPs to bypass limits
|
||||
- Violate terms of service
|
||||
- Burden servers unnecessarily
|
||||
- Use data commercially without permission
|
||||
|
||||
### Institutional Access
|
||||
|
||||
**Benefits of institutional access**:
|
||||
- Access to full-text PDFs through library subscriptions
|
||||
- Better download capabilities
|
||||
- Integration with library systems
|
||||
- Link resolver to full text
|
||||
|
||||
**Setup**:
|
||||
- Google Scholar → Settings → Library links
|
||||
- Add your institution
|
||||
- Links appear in search results
|
||||
|
||||
## Tips and Best Practices
|
||||
|
||||
### Search Optimization
|
||||
|
||||
1. **Start simple, then refine**:
|
||||
```
|
||||
# Too specific initially
|
||||
intitle:"deep learning" intitle:review source:Nature 2023..2024
|
||||
|
||||
# Better approach
|
||||
deep learning review
|
||||
# Review results
|
||||
# Add intitle:, source:, year filters as needed
|
||||
```
|
||||
|
||||
2. **Use multiple search strategies**:
|
||||
- Keyword search
|
||||
- Author search for known experts
|
||||
- Citation chaining from key papers
|
||||
- Source search in top journals
|
||||
|
||||
3. **Check spelling and variations**:
|
||||
- Color vs colour
|
||||
- Optimization vs optimisation
|
||||
- Tumor vs tumour
|
||||
- Try common misspellings if few results
|
||||
|
||||
4. **Combine operators strategically**:
|
||||
```
|
||||
# Good combination
|
||||
author:Church intitle:"synthetic biology" 2015..2024
|
||||
|
||||
# Find reviews by specific author on topic in recent years
|
||||
```
|
||||
|
||||
### Result Evaluation
|
||||
|
||||
1. **Check citation counts**:
|
||||
- High citations indicate influence
|
||||
- Recent papers may have low citations but be important
|
||||
- Citation counts vary by field
|
||||
|
||||
2. **Verify publication venue**:
|
||||
- Peer-reviewed journals vs preprints
|
||||
- Conference proceedings
|
||||
- Book chapters
|
||||
- Technical reports
|
||||
|
||||
3. **Check for full text access**:
|
||||
- [PDF] link on right side
|
||||
- "All X versions" may have open access version
|
||||
- Check institutional access
|
||||
- Try author's website or ResearchGate
|
||||
|
||||
4. **Look for review articles**:
|
||||
- Comprehensive overviews
|
||||
- Good starting point for new topics
|
||||
- Extensive reference lists
|
||||
|
||||
### Managing Results
|
||||
|
||||
1. **Use citation manager integration**:
|
||||
- Export to BibTeX
|
||||
- Import to Zotero, Mendeley, EndNote
|
||||
- Maintain organized library
|
||||
|
||||
2. **Set up alerts** for ongoing research:
|
||||
- Google Scholar → Alerts
|
||||
- Get emails for new papers matching query
|
||||
- Track specific authors or topics
|
||||
|
||||
3. **Create collections**:
|
||||
- Save papers to Google Scholar Library
|
||||
- Organize by project or topic
|
||||
- Add labels and notes
|
||||
|
||||
4. **Export systematically**:
|
||||
```bash
|
||||
# Save search results for later analysis
|
||||
python scripts/search_google_scholar.py "your topic" \
|
||||
--output topic_papers.json
|
||||
|
||||
# Can re-process later without re-searching
|
||||
python scripts/extract_metadata.py \
|
||||
--input topic_papers.json \
|
||||
--output topic_refs.bib
|
||||
```
|
||||
|
||||
## Advanced Techniques
|
||||
|
||||
### Boolean Logic Combinations
|
||||
|
||||
Combine multiple operators for precise searches:
|
||||
|
||||
```
|
||||
# Highly cited reviews on specific topic by known authors
|
||||
intitle:review "machine learning" ("drug discovery" OR "drug development")
|
||||
author:Horvath OR author:Bengio 2020..2024
|
||||
|
||||
# Method papers excluding reviews
|
||||
intitle:method "protein folding" -review -survey
|
||||
|
||||
# Papers in top journals only
|
||||
("Nature" OR "Science" OR "Cell") CRISPR 2022..2024
|
||||
```
|
||||
|
||||
### Finding Open Access Papers
|
||||
|
||||
```
|
||||
# Search with generic terms
|
||||
machine learning
|
||||
|
||||
# Filter by "All versions" which often includes preprints
|
||||
# Look for green [PDF] links (often open access)
|
||||
# Check arXiv, bioRxiv versions
|
||||
```
|
||||
|
||||
**In script**:
|
||||
```bash
|
||||
python scripts/search_google_scholar.py "topic" \
|
||||
--open-access-only \
|
||||
--output open_access_papers.json
|
||||
```
|
||||
|
||||
### Tracking Research Impact
|
||||
|
||||
**For a specific paper**:
|
||||
1. Find the paper
|
||||
2. Click "Cited by X"
|
||||
3. Analyze citing papers:
|
||||
- How is it being used?
|
||||
- What fields cite it?
|
||||
- Recent vs older citations?
|
||||
|
||||
**For an author**:
|
||||
1. Search `author:LastName`
|
||||
2. Check h-index and i10-index
|
||||
3. View citation history graph
|
||||
4. Identify most influential papers
|
||||
|
||||
**For a topic**:
|
||||
1. Search topic
|
||||
2. Sort by citations
|
||||
3. Identify seminal papers (highly cited, older)
|
||||
4. Check recent highly-cited papers (emerging important work)
|
||||
|
||||
### Finding Preprints and Early Work
|
||||
|
||||
```
|
||||
# arXiv papers
|
||||
source:arxiv "deep learning"
|
||||
|
||||
# bioRxiv papers
|
||||
source:biorxiv CRISPR
|
||||
|
||||
# All preprint servers
|
||||
("arxiv" OR "biorxiv" OR "medrxiv") your topic
|
||||
```
|
||||
|
||||
**Note**: Preprints are not peer-reviewed. Always check if published version exists.
|
||||
|
||||
## Common Issues and Solutions
|
||||
|
||||
### Too Many Results
|
||||
|
||||
**Problem**: Search returns 100,000+ results, overwhelming.
|
||||
|
||||
**Solutions**:
|
||||
1. Add more specific terms
|
||||
2. Use `intitle:` to search only titles
|
||||
3. Filter by recent years
|
||||
4. Add exclusions (e.g., `-review`)
|
||||
5. Search within specific journals
|
||||
|
||||
### Too Few Results
|
||||
|
||||
**Problem**: Search returns 0-10 results, suspiciously few.
|
||||
|
||||
**Solutions**:
|
||||
1. Remove restrictive operators
|
||||
2. Try synonyms and related terms
|
||||
3. Check spelling
|
||||
4. Broaden year range
|
||||
5. Use OR for alternative terms
|
||||
|
||||
### Irrelevant Results
|
||||
|
||||
**Problem**: Results don't match intent.
|
||||
|
||||
**Solutions**:
|
||||
1. Use exact phrases with quotes
|
||||
2. Add more specific context terms
|
||||
3. Use `intitle:` for title-only search
|
||||
4. Exclude common irrelevant terms
|
||||
5. Combine multiple specific terms
|
||||
|
||||
### CAPTCHA or Rate Limiting
|
||||
|
||||
**Problem**: Google Scholar shows CAPTCHA or blocks access.
|
||||
|
||||
**Solutions**:
|
||||
1. Wait several minutes before continuing
|
||||
2. Reduce query frequency
|
||||
3. Use longer delays in scripts (5-10 seconds)
|
||||
4. Switch to different IP/network
|
||||
5. Consider using institutional access
|
||||
|
||||
### Missing Metadata
|
||||
|
||||
**Problem**: Author names, year, or venue missing from results.
|
||||
|
||||
**Solutions**:
|
||||
1. Click through to see full details
|
||||
2. Check "All versions" for better metadata
|
||||
3. Look up by DOI if available
|
||||
4. Extract metadata from CrossRef/PubMed instead
|
||||
5. Manually verify from paper PDF
|
||||
|
||||
### Duplicate Results
|
||||
|
||||
**Problem**: Same paper appears multiple times.
|
||||
|
||||
**Solutions**:
|
||||
1. Click "All X versions" to see consolidated view
|
||||
2. Choose version with best metadata
|
||||
3. Use deduplication in post-processing:
|
||||
```bash
|
||||
python scripts/format_bibtex.py results.bib \
|
||||
--deduplicate \
|
||||
--output clean_results.bib
|
||||
```
|
||||
|
||||
## Integration with Scripts
|
||||
|
||||
### search_google_scholar.py Usage
|
||||
|
||||
**Basic search**:
|
||||
```bash
|
||||
python scripts/search_google_scholar.py "machine learning drug discovery"
|
||||
```
|
||||
|
||||
**With year filter**:
|
||||
```bash
|
||||
python scripts/search_google_scholar.py "CRISPR" \
|
||||
--year-start 2020 \
|
||||
--year-end 2024 \
|
||||
--limit 100
|
||||
```
|
||||
|
||||
**Sort by citations**:
|
||||
```bash
|
||||
python scripts/search_google_scholar.py "transformers" \
|
||||
--sort-by citations \
|
||||
--limit 50
|
||||
```
|
||||
|
||||
**Export to BibTeX**:
|
||||
```bash
|
||||
python scripts/search_google_scholar.py "quantum computing" \
|
||||
--format bibtex \
|
||||
--output quantum.bib
|
||||
```
|
||||
|
||||
**Export to JSON for later processing**:
|
||||
```bash
|
||||
python scripts/search_google_scholar.py "topic" \
|
||||
--format json \
|
||||
--output results.json
|
||||
|
||||
# Later: extract full metadata
|
||||
python scripts/extract_metadata.py \
|
||||
--input results.json \
|
||||
--output references.bib
|
||||
```
|
||||
|
||||
### Batch Searching
|
||||
|
||||
For multiple topics:
|
||||
|
||||
```bash
|
||||
# Create file with search queries (queries.txt)
|
||||
# One query per line
|
||||
|
||||
# Search each query
|
||||
while read query; do
|
||||
python scripts/search_google_scholar.py "$query" \
|
||||
--limit 50 \
|
||||
--output "${query// /_}.json"
|
||||
sleep 10 # Delay between queries
|
||||
done < queries.txt
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
Google Scholar is the most comprehensive academic search engine, providing:
|
||||
|
||||
✓ **Broad coverage**: All disciplines, 100M+ documents
|
||||
✓ **Free access**: No account or subscription required
|
||||
✓ **Citation tracking**: "Cited by" for impact analysis
|
||||
✓ **Multiple formats**: Articles, books, theses, patents
|
||||
✓ **Full-text search**: Not just abstracts
|
||||
|
||||
Key strategies:
|
||||
- Use advanced operators for precision
|
||||
- Combine author, title, source searches
|
||||
- Track citations for impact
|
||||
- Export systematically to citation manager
|
||||
- Respect rate limits and access policies
|
||||
- Verify metadata with CrossRef/PubMed
|
||||
|
||||
For biomedical research, complement with PubMed for MeSH terms and curated metadata.
|
||||
|
||||
Reference in New Issue
Block a user