Initial commit

2025-11-30 08:30:14 +08:00
commit 1dd5bee3b4
335 changed files with 147360 additions and 0 deletions
--- a/skills/scholar-evaluation/SKILL.md
+++ b/skills/scholar-evaluation/SKILL.md
@@ -0,0 +1,289 @@
+# Scholar Evaluation
+
+## Overview
+
+Apply the ScholarEval framework to systematically evaluate scholarly and research work. This skill provides structured evaluation methodology based on peer-reviewed research assessment criteria, enabling comprehensive analysis of academic papers, research proposals, literature reviews, and scholarly writing across multiple quality dimensions.
+
+## When to Use This Skill
+
+Use this skill when:
+- Evaluating research papers for quality and rigor
+- Assessing literature review comprehensiveness and quality
+- Reviewing research methodology design
+- Scoring data analysis approaches
+- Evaluating scholarly writing and presentation
+- Providing structured feedback on academic work
+- Benchmarking research quality against established criteria
+- Assessing publication readiness for target venues
+- Providing quantitative evaluation to complement qualitative peer review
+
+## Visual Enhancement with Scientific Schematics
+
+**When creating documents with this skill, always consider adding scientific diagrams and schematics to enhance visual communication.**
+
+If your document does not already contain schematics or diagrams:
+- Use the **scientific-schematics** skill to generate AI-powered publication-quality diagrams
+- Simply describe your desired diagram in natural language
+- Nano Banana Pro will automatically generate, review, and refine the schematic
+
+**For new documents:** Scientific schematics should be generated by default to visually represent key concepts, workflows, architectures, or relationships described in the text.
+
+**How to generate schematics:**
+```bash
+python scripts/generate_schematic.py "your diagram description" -o figures/output.png
+```
+
+The AI will automatically:
+- Create publication-quality images with proper formatting
+- Review and refine through multiple iterations
+- Ensure accessibility (colorblind-friendly, high contrast)
+- Save outputs in the figures/ directory
+
+**When to add schematics:**
+- Evaluation framework diagrams
+- Quality assessment criteria decision trees
+- Scholarly workflow visualizations
+- Assessment methodology flowcharts
+- Scoring rubric visualizations
+- Evaluation process diagrams
+- Any complex concept that benefits from visualization
+
+For detailed guidance on creating schematics, refer to the scientific-schematics skill documentation.
+
+---
+
+## Evaluation Workflow
+
+### Step 1: Initial Assessment and Scope Definition
+
+Begin by identifying the type of scholarly work being evaluated and the evaluation scope:
+
+**Work Types:**
+- Full research paper (empirical, theoretical, or review)
+- Research proposal or protocol
+- Literature review (systematic, narrative, or scoping)
+- Thesis or dissertation chapter
+- Conference abstract or short paper
+
+**Evaluation Scope:**
+- Comprehensive (all dimensions)
+- Targeted (specific aspects like methodology or writing)
+- Comparative (benchmarking against other work)
+
+Ask the user to clarify if the scope is ambiguous.
+
+### Step 2: Dimension-Based Evaluation
+
+Systematically evaluate the work across the ScholarEval dimensions. For each applicable dimension, assess quality, identify strengths and weaknesses, and provide scores where appropriate.
+
+Refer to `references/evaluation_framework.md` for detailed criteria and rubrics for each dimension.
+
+**Core Evaluation Dimensions:**
+
+1. **Problem Formulation & Research Questions**
+   - Clarity and specificity of research questions
+   - Theoretical or practical significance
+   - Feasibility and scope appropriateness
+   - Novelty and contribution potential
+
+2. **Literature Review**
+   - Comprehensiveness of coverage
+   - Critical synthesis vs. mere summarization
+   - Identification of research gaps
+   - Currency and relevance of sources
+   - Proper contextualization
+
+3. **Methodology & Research Design**
+   - Appropriateness for research questions
+   - Rigor and validity
+   - Reproducibility and transparency
+   - Ethical considerations
+   - Limitations acknowledgment
+
+4. **Data Collection & Sources**
+   - Quality and appropriateness of data
+   - Sample size and representativeness
+   - Data collection procedures
+   - Source credibility and reliability
+
+5. **Analysis & Interpretation**
+   - Appropriateness of analytical methods
+   - Rigor of analysis
+   - Logical coherence
+   - Alternative explanations considered
+   - Results-claims alignment
+
+6. **Results & Findings**
+   - Clarity of presentation
+   - Statistical or qualitative rigor
+   - Visualization quality
+   - Interpretation accuracy
+   - Implications discussion
+
+7. **Scholarly Writing & Presentation**
+   - Clarity and organization
+   - Academic tone and style
+   - Grammar and mechanics
+   - Logical flow
+   - Accessibility to target audience
+
+8. **Citations & References**
+   - Citation completeness
+   - Source quality and appropriateness
+   - Citation accuracy
+   - Balance of perspectives
+   - Adherence to citation standards
+
+### Step 3: Scoring and Rating
+
+For each evaluated dimension, provide:
+
+**Qualitative Assessment:**
+- Key strengths (2-3 specific points)
+- Areas for improvement (2-3 specific points)
+- Critical issues (if any)
+
+**Quantitative Scoring (Optional):**
+Use a 5-point scale where applicable:
+- 5: Excellent - Exemplary quality, publishable in top venues
+- 4: Good - Strong quality with minor improvements needed
+- 3: Adequate - Acceptable quality with notable areas for improvement
+- 2: Needs Improvement - Significant revisions required
+- 1: Poor - Fundamental issues requiring major revision
+
+To calculate aggregate scores programmatically, use `scripts/calculate_scores.py`.
+
+### Step 4: Synthesize Overall Assessment
+
+Provide an integrated evaluation summary:
+
+1. **Overall Quality Assessment** - Holistic judgment of the work's scholarly merit
+2. **Major Strengths** - 3-5 key strengths across dimensions
+3. **Critical Weaknesses** - 3-5 primary areas requiring attention
+4. **Priority Recommendations** - Ranked list of improvements by impact
+5. **Publication Readiness** (if applicable) - Assessment of suitability for target venues
+
+### Step 5: Provide Actionable Feedback
+
+Transform evaluation findings into constructive, actionable feedback:
+
+**Feedback Structure:**
+- **Specific** - Reference exact sections, paragraphs, or page numbers
+- **Actionable** - Provide concrete suggestions for improvement
+- **Prioritized** - Rank recommendations by importance and feasibility
+- **Balanced** - Acknowledge strengths while addressing weaknesses
+- **Evidence-based** - Ground feedback in evaluation criteria
+
+**Feedback Format Options:**
+- Structured report with dimension-by-dimension analysis
+- Annotated comments mapped to specific document sections
+- Executive summary with key findings and recommendations
+- Comparative analysis against benchmark standards
+
+### Step 6: Contextual Considerations
+
+Adjust evaluation approach based on:
+
+**Stage of Development:**
+- Early draft: Focus on conceptual and structural issues
+- Advanced draft: Focus on refinement and polish
+- Final submission: Comprehensive quality check
+
+**Purpose and Venue:**
+- Journal article: High standards for rigor and contribution
+- Conference paper: Balance novelty with presentation clarity
+- Student work: Educational feedback with developmental focus
+- Grant proposal: Emphasis on feasibility and impact
+
+**Discipline-Specific Norms:**
+- STEM fields: Emphasis on reproducibility and statistical rigor
+- Social sciences: Balance quantitative and qualitative standards
+- Humanities: Focus on argumentation and scholarly interpretation
+
+## Resources
+
+### references/evaluation_framework.md
+
+Detailed evaluation criteria, rubrics, and quality indicators for each ScholarEval dimension. Load this reference when conducting evaluations to access specific assessment guidelines and scoring rubrics.
+
+Search patterns for quick access:
+- "Problem Formulation criteria"
+- "Literature Review rubric"
+- "Methodology assessment"
+- "Data quality indicators"
+- "Analysis rigor standards"
+- "Writing quality checklist"
+
+### scripts/calculate_scores.py
+
+Python script for calculating aggregate evaluation scores from dimension-level ratings. Supports weighted averaging, threshold analysis, and score visualization.
+
+Usage:
+```bash
+python scripts/calculate_scores.py --scores <dimension_scores.json> --output <report.txt>
+```
+
+## Best Practices
+
+1. **Maintain Objectivity** - Base evaluations on established criteria, not personal preferences
+2. **Be Comprehensive** - Evaluate all applicable dimensions systematically
+3. **Provide Evidence** - Support assessments with specific examples from the work
+4. **Stay Constructive** - Frame weaknesses as opportunities for improvement
+5. **Consider Context** - Adjust expectations based on work stage and purpose
+6. **Document Rationale** - Explain the reasoning behind assessments and scores
+7. **Encourage Strengths** - Explicitly acknowledge what the work does well
+8. **Prioritize Feedback** - Focus on high-impact improvements first
+
+## Example Evaluation Workflow
+
+**User Request:** "Evaluate this research paper on machine learning for drug discovery"
+
+**Response Process:**
+1. Identify work type (empirical research paper) and scope (comprehensive evaluation)
+2. Load `references/evaluation_framework.md` for detailed criteria
+3. Systematically assess each dimension:
+   - Problem formulation: Clear research question about ML model performance
+   - Literature review: Comprehensive coverage of recent ML and drug discovery work
+   - Methodology: Appropriate deep learning architecture with validation procedures
+   - [Continue through all dimensions...]
+4. Calculate dimension scores and overall assessment
+5. Synthesize findings into structured report highlighting:
+   - Strong methodology and reproducible code
+   - Needs more diverse dataset evaluation
+   - Writing could improve clarity in results section
+6. Provide prioritized recommendations with specific suggestions
+
+## Integration with Scientific Writer
+
+This skill integrates seamlessly with the scientific writer workflow:
+
+**After Paper Generation:**
+- Use Scholar Evaluation as an alternative or complement to peer review
+- Generate `SCHOLAR_EVALUATION.md` alongside `PEER_REVIEW.md`
+- Provide quantitative scores to track improvement across revisions
+
+**During Revision:**
+- Re-evaluate specific dimensions after addressing feedback
+- Track score improvements over multiple versions
+- Identify persistent weaknesses requiring attention
+
+**Publication Preparation:**
+- Assess readiness for target journal/conference
+- Identify gaps before submission
+- Benchmark against publication standards
+
+## Notes
+
+- Evaluation rigor should match the work's purpose and stage
+- Some dimensions may not apply to all work types (e.g., data collection for purely theoretical papers)
+- Cultural and disciplinary differences in scholarly norms should be considered
+- This framework complements, not replaces, domain-specific expertise
+- Use in combination with peer-review skill for comprehensive assessment
+
+## Citation
+
+This skill is based on the ScholarEval framework introduced in:
+
+**Moussa, H. N., Da Silva, P. Q., Adu-Ampratwum, D., East, A., Lu, Z., Puccetti, N., Xue, M., Sun, H., Majumder, B. P., & Kumar, S. (2025).** _ScholarEval: Research Idea Evaluation Grounded in Literature_. arXiv preprint arXiv:2510.16234. [https://arxiv.org/abs/2510.16234](https://arxiv.org/abs/2510.16234)
+
+**Abstract:** ScholarEval is a retrieval augmented evaluation framework that assesses research ideas based on two fundamental criteria: soundness (the empirical validity of proposed methods based on existing literature) and contribution (the degree of advancement made by the idea across different dimensions relative to prior research). The framework achieves significantly higher coverage of expert-annotated evaluation points and is consistently preferred over baseline systems in terms of evaluation actionability, depth, and evidence support.
--- a/skills/scholar-evaluation/references/evaluation_framework.md
+++ b/skills/scholar-evaluation/references/evaluation_framework.md
@@ -0,0 +1,663 @@
+# ScholarEval Evaluation Framework
+
+## Overview
+
+This document provides detailed evaluation criteria, rubrics, and quality indicators for each dimension of the ScholarEval framework. Use these standards when conducting systematic evaluations of scholarly work.
+
+---
+
+## Dimension 1: Problem Formulation & Research Questions
+
+### Quality Indicators
+
+**Excellent (5):**
+- Research question is specific, measurable, and clearly articulated
+- Problem addresses significant gap in literature with high impact potential
+- Scope is appropriate and feasible within constraints
+- Novel contribution is clearly differentiated from existing work
+- Theoretical or practical significance is compellingly justified
+
+**Good (4):**
+- Research question is clear with minor ambiguities
+- Problem is relevant with moderate impact potential
+- Scope is generally appropriate with minor feasibility concerns
+- Contribution is identifiable though not groundbreaking
+- Significance is adequately justified
+
+**Adequate (3):**
+- Research question is present but lacks specificity
+- Problem relevance is unclear or incremental
+- Scope may be too broad or narrow
+- Contribution is unclear or overlaps heavily with existing work
+- Significance justification is weak
+
+**Needs Improvement (2):**
+- Research question is vague or poorly defined
+- Problem lacks clear relevance or significance
+- Scope is inappropriate or infeasible
+- Contribution is not articulated
+- No clear justification for significance
+
+**Poor (1):**
+- No clear research question
+- Problem is trivial or irrelevant
+- Scope is fundamentally flawed
+- No identifiable contribution
+- No significance justification
+
+### Assessment Checklist
+
+- [ ] Is the research question clearly stated?
+- [ ] Can the question be answered with the proposed approach?
+- [ ] Is the problem significant to the field?
+- [ ] Is the scope feasible within resource constraints?
+- [ ] Is the novelty/contribution clearly articulated?
+- [ ] Are key assumptions explicitly stated?
+- [ ] Are success criteria or expected outcomes defined?
+
+---
+
+## Dimension 2: Literature Review
+
+### Quality Indicators
+
+**Excellent (5):**
+- Comprehensive coverage of relevant literature across key areas
+- Critical synthesis identifying patterns, contradictions, and gaps
+- Literature is current (majority from last 3-5 years for rapidly evolving fields)
+- Sources are authoritative and peer-reviewed
+- Clear positioning of current work within scholarly conversation
+- Identifies genuine research gaps that the work addresses
+
+**Good (4):**
+- Good coverage with minor gaps in key areas
+- Mostly synthesis with some description
+- Literature is mostly current with some older foundational works
+- Sources are generally authoritative
+- Work positioning is present but could be stronger
+- Research gaps are identified but may not be critical
+
+**Adequate (3):**
+- Partial coverage with notable gaps
+- More descriptive summarization than synthesis
+- Literature mix of current and dated sources
+- Mix of authoritative and less rigorous sources
+- Weak positioning within existing literature
+- Research gaps are vague or questionable
+
+**Needs Improvement (2):**
+- Minimal coverage with major gaps
+- Purely descriptive without synthesis
+- Literature is largely outdated
+- Sources lack authority or rigor
+- Little to no positioning of current work
+- No clear research gaps identified
+
+**Poor (1):**
+- Inadequate or absent literature review
+- No synthesis
+- Outdated or inappropriate sources
+- No engagement with scholarly conversation
+- No gap identification
+
+### Assessment Checklist
+
+- [ ] Does review cover all major relevant areas?
+- [ ] Is literature synthesized rather than just summarized?
+- [ ] Are sources current and authoritative?
+- [ ] Are contrasting viewpoints presented?
+- [ ] Are research gaps clearly identified?
+- [ ] Is the current work positioned within existing literature?
+- [ ] Is citation balance appropriate (not over-relying on few authors)?
+- [ ] Are seminal/foundational works included?
+
+### Common Issues
+
+- **Insufficient coverage**: Missing key papers or research streams
+- **Descriptive listing**: Summarizing papers sequentially without synthesis
+- **Outdated sources**: Relying on literature more than 5-10 years old
+- **Cherry-picking**: Only citing work that supports hypothesis
+- **Poor organization**: Lack of thematic or conceptual structure
+- **Weak gap identification**: Gaps are trivial or not actually gaps
+
+---
+
+## Dimension 3: Methodology & Research Design
+
+### Quality Indicators
+
+**Excellent (5):**
+- Research design perfectly aligned with research questions
+- Methods are rigorous, valid, and reliable
+- Procedures are detailed enough for replication
+- Controls, randomization, or triangulation appropriate
+- Potential biases acknowledged and mitigated
+- Ethical considerations addressed comprehensively
+- Limitations are explicitly discussed
+
+**Good (4):**
+- Design is appropriate with minor alignment issues
+- Methods are sound with small validity concerns
+- Procedures are mostly replicable
+- Some controls or validation present
+- Major biases addressed
+- Ethical considerations mentioned
+- Some limitations discussed
+
+**Adequate (3):**
+- Design partially appropriate for questions
+- Methods have notable validity concerns
+- Procedures lack detail for full replication
+- Limited controls or validation
+- Bias mitigation is minimal
+- Ethics addressed superficially
+- Limitations minimally discussed
+
+**Needs Improvement (2):**
+- Design poorly aligned with research questions
+- Methods have serious validity issues
+- Procedures too vague to replicate
+- No controls or validation
+- Biases not addressed
+- Ethical concerns not addressed
+- No limitation discussion
+
+**Poor (1):**
+- Inappropriate or absent methodology
+- Methods fundamentally flawed
+- Not replicable
+- No validity considerations
+- No ethical considerations
+- No acknowledgment of limitations
+
+### Assessment Checklist
+
+- [ ] Is methodology appropriate for research questions?
+- [ ] Are procedures described in sufficient detail?
+- [ ] Can the study be replicated from the description?
+- [ ] Are validity and reliability addressed?
+- [ ] Are potential biases identified and mitigated?
+- [ ] Are ethical considerations discussed?
+- [ ] Are limitations acknowledged?
+- [ ] Is sample size justified (for quantitative work)?
+- [ ] Are qualitative methods rigorous (if applicable)?
+
+### Design-Specific Considerations
+
+**Quantitative Studies:**
+- Sample size with power analysis
+- Control groups and randomization
+- Measurement validity and reliability
+- Statistical assumptions checking
+
+**Qualitative Studies:**
+- Sampling strategy and saturation
+- Data collection procedures
+- Coding and analysis framework
+- Trustworthiness criteria (credibility, transferability, etc.)
+
+**Mixed Methods:**
+- Integration rationale
+- Sequencing justification
+- Data convergence strategy
+
+---
+
+## Dimension 4: Data Collection & Sources
+
+### Quality Indicators
+
+**Excellent (5):**
+- Data sources are highly credible and appropriate
+- Sample size is sufficient and well-justified
+- Data collection procedures are rigorous and systematic
+- Data quality controls are in place
+- Sampling strategy ensures representativeness
+- Missing data is minimal and handled appropriately
+
+**Good (4):**
+- Data sources are credible with minor concerns
+- Sample size is adequate
+- Collection procedures are systematic
+- Some quality controls present
+- Sampling is reasonable
+- Missing data is addressed
+
+**Adequate (3):**
+- Data sources are acceptable but not optimal
+- Sample size is marginal
+- Collection procedures lack some rigor
+- Limited quality controls
+- Sampling may have bias concerns
+- Missing data handling is basic
+
+**Needs Improvement (2):**
+- Data sources have credibility issues
+- Sample size is insufficient
+- Collection procedures are ad hoc
+- No quality controls
+- Sampling is clearly biased
+- Missing data not addressed
+
+**Poor (1):**
+- Data sources are inappropriate or unreliable
+- Sample size is inadequate
+- Collection is unsystematic
+- No quality considerations
+- Sampling is fundamentally flawed
+- Excessive missing data
+
+### Assessment Checklist
+
+- [ ] Are data sources credible and appropriate?
+- [ ] Is sample size sufficient for conclusions?
+- [ ] Is sampling strategy clearly described?
+- [ ] Is the sample representative of target population?
+- [ ] Are data collection procedures systematic?
+- [ ] Are data quality controls described?
+- [ ] Is missing data addressed?
+- [ ] Are any potential data biases discussed?
+
+---
+
+## Dimension 5: Analysis & Interpretation
+
+### Quality Indicators
+
+**Excellent (5):**
+- Analytical methods perfectly suited to data and questions
+- Analysis is rigorous with appropriate techniques
+- Results interpretation is logical and well-supported
+- Alternative explanations are considered
+- Claims are proportionate to evidence
+- Assumptions are validated
+- Analysis is transparent and reproducible
+
+**Good (4):**
+- Methods are appropriate with minor issues
+- Analysis is sound
+- Interpretation is mostly logical
+- Some alternatives considered
+- Claims generally match evidence
+- Key assumptions checked
+- Analysis is mostly transparent
+
+**Adequate (3):**
+- Methods are acceptable but not optimal
+- Analysis has some technical issues
+- Interpretation has logical gaps
+- Alternatives not thoroughly explored
+- Some claims exceed evidence
+- Assumptions not fully validated
+- Analysis transparency is limited
+
+**Needs Improvement (2):**
+- Methods are questionable for data/questions
+- Analysis has significant technical flaws
+- Interpretation is poorly supported
+- No alternative explanations
+- Claims significantly exceed evidence
+- Assumptions not checked
+- Analysis is not transparent
+
+**Poor (1):**
+- Methods are inappropriate
+- Analysis is fundamentally flawed
+- Interpretation is illogical
+- No consideration of alternatives
+- Claims unsupported by evidence
+- No assumption validation
+- Analysis is opaque
+
+### Assessment Checklist
+
+- [ ] Are analytical methods appropriate?
+- [ ] Are statistical tests/qualitative methods properly applied?
+- [ ] Are assumptions tested?
+- [ ] Is interpretation logical and well-supported?
+- [ ] Are alternative explanations considered?
+- [ ] Do claims align with evidence strength?
+- [ ] Is analysis reproducible from description?
+- [ ] Are uncertainties acknowledged?
+
+### Quantitative Analysis
+
+- Appropriate statistical tests
+- Assumptions checked (normality, homogeneity, etc.)
+- Effect sizes reported
+- Confidence intervals provided
+- Multiple testing corrections (if applicable)
+- Model diagnostics performed
+
+### Qualitative Analysis
+
+- Coding framework is clear
+- Inter-rater reliability (if applicable)
+- Saturation discussed
+- Negative cases examined
+- Member checking or validation
+- Clear audit trail
+
+---
+
+## Dimension 6: Results & Findings
+
+### Quality Indicators
+
+**Excellent (5):**
+- Results are clearly and comprehensively presented
+- Visualizations are effective and appropriate
+- Statistical or qualitative rigor is evident
+- Key findings are highlighted effectively
+- Results directly address research questions
+- Patterns and relationships are clearly shown
+- Negative and null results are reported
+
+**Good (4):**
+- Results are clear with minor presentation issues
+- Visualizations are generally effective
+- Rigor is present
+- Main findings are identifiable
+- Results mostly address questions
+- Patterns are shown
+- Some negative results included
+
+**Adequate (3):**
+- Results presentation is adequate but could be clearer
+- Visualizations are basic or have issues
+- Rigor is questionable in places
+- Findings are present but not emphasized
+- Partial alignment with questions
+- Patterns are unclear
+- Negative results may be omitted
+
+**Needs Improvement (2):**
+- Results presentation is unclear or confusing
+- Visualizations are poor or misleading
+- Lack of rigor
+- Findings are difficult to identify
+- Weak alignment with questions
+- No clear patterns
+- Only positive results shown
+
+**Poor (1):**
+- Results are poorly presented or absent
+- Visualizations are inappropriate or missing
+- No evidence of rigor
+- Findings are unclear
+- Results don't address questions
+- No identifiable patterns
+- Results appear selective
+
+### Assessment Checklist
+
+- [ ] Are results clearly presented?
+- [ ] Do results directly address research questions?
+- [ ] Are visualizations appropriate and effective?
+- [ ] Are key findings highlighted?
+- [ ] Are negative/null results reported?
+- [ ] Is appropriate precision reported (p-values, CIs, effect sizes)?
+- [ ] Are qualitative findings supported by data excerpts?
+- [ ] Is there evidence of selective reporting?
+
+### Presentation Quality
+
+**Tables:**
+- Clear labels and captions
+- Appropriate precision
+- Organized logically
+- Not overly complex
+
+**Figures:**
+- Clear axes and legends
+- Appropriate chart type
+- Professional appearance
+- Accessible (color-blind friendly)
+
+**Text:**
+- Highlights key findings
+- Avoids redundancy with tables/figures
+- Uses appropriate statistical language
+
+---
+
+## Dimension 7: Scholarly Writing & Presentation
+
+### Quality Indicators
+
+**Excellent (5):**
+- Writing is clear, concise, and precise
+- Organization is logical with excellent flow
+- Academic tone is appropriate and consistent
+- Grammar and mechanics are flawless
+- Technical terms are used correctly
+- Accessible to target audience
+- Abstract/summary is comprehensive and accurate
+
+**Good (4):**
+- Writing is clear with minor awkwardness
+- Organization is logical with good flow
+- Tone is mostly appropriate
+- Few grammar/mechanical errors
+- Technical terms mostly correct
+- Generally accessible
+- Abstract is adequate
+
+**Adequate (3):**
+- Writing is understandable but has clarity issues
+- Organization has some logical gaps
+- Tone inconsistencies
+- Noticeable grammar/mechanical errors
+- Some technical term misuse
+- Accessibility issues for target audience
+- Abstract is incomplete or vague
+
+**Needs Improvement (2):**
+- Writing is often unclear or verbose
+- Poor organization and flow
+- Tone is inappropriate
+- Frequent grammar/mechanical errors
+- Technical terminology problems
+- Not accessible to target audience
+- Abstract is poor or missing
+
+**Poor (1):**
+- Writing is unclear and difficult to follow
+- No clear organization
+- Tone is inappropriate
+- Pervasive grammar/mechanical errors
+- Incorrect technical terminology
+- Inaccessible
+- No adequate abstract
+
+### Assessment Checklist
+
+- [ ] Is writing clear and concise?
+- [ ] Is organization logical?
+- [ ] Is tone appropriate for academic writing?
+- [ ] Are grammar and mechanics correct?
+- [ ] Are technical terms used appropriately?
+- [ ] Is jargon explained when necessary?
+- [ ] Does abstract accurately summarize the work?
+- [ ] Are transitions between sections smooth?
+- [ ] Is the target audience clear?
+
+### Common Writing Issues
+
+- **Wordiness**: Unnecessarily complex or lengthy prose
+- **Passive voice overuse**: Reduces clarity and directness
+- **Paragraph structure**: Lack of topic sentences or coherence
+- **Redundancy**: Repeating information unnecessarily
+- **Logical flow**: Poor transitions between ideas
+- **Precision**: Vague or ambiguous language
+- **Accessibility**: Too technical or not technical enough
+
+---
+
+## Dimension 8: Citations & References
+
+### Quality Indicators
+
+**Excellent (5):**
+- All claims are appropriately cited
+- Sources are authoritative and current
+- Citations are accurate and complete
+- Diverse perspectives are represented
+- Citation format is consistent and correct
+- Balance between self-citation and others
+- Primary sources used appropriately
+
+**Good (4):**
+- Most claims are cited
+- Sources are generally authoritative
+- Few citation errors
+- Reasonable diversity of sources
+- Format is mostly consistent
+- Citation balance is good
+- Mix of primary and secondary sources
+
+**Adequate (3):**
+- Some claims lack citations
+- Source quality is mixed
+- Several citation errors
+- Limited source diversity
+- Format inconsistencies
+- Citation balance issues
+- Over-reliance on secondary sources
+
+**Needs Improvement (2):**
+- Many claims uncited
+- Sources are questionable
+- Numerous citation errors
+- Narrow source base
+- Format is inconsistent
+- Excessive self-citation or narrow citing
+- Inappropriate sources (e.g., only secondary)
+
+**Poor (1):**
+- Inadequate citations
+- Unreliable sources
+- Pervasive citation errors
+- Minimal source diversity
+- No consistent format
+- Severe citation imbalance
+- Inappropriate source types
+
+### Assessment Checklist
+
+- [ ] Are all factual claims cited?
+- [ ] Are citations to primary sources when appropriate?
+- [ ] Are sources authoritative and peer-reviewed?
+- [ ] Is there balance in perspectives cited?
+- [ ] Are citations accurate (authors, dates, pages)?
+- [ ] Is citation format consistent?
+- [ ] Are self-citations appropriate (typically <20%)?
+- [ ] Are sources current (for time-sensitive topics)?
+- [ ] Are classic/seminal works included where relevant?
+
+### Citation Quality Assessment
+
+**Source Types (in order of preference for most academic work):**
+1. Peer-reviewed journal articles
+2. Academic books from reputable publishers
+3. Conference proceedings (field-dependent)
+4. Technical reports from reputable institutions
+5. Dissertations/theses
+6. Preprints (with caution, field-dependent)
+7. Grey literature (limited use)
+8. Websites (rarely appropriate, except for factual data)
+
+**Red Flags:**
+- Wikipedia as a primary source
+- Excessive self-citation (>30%)
+- Only citing papers that support hypothesis
+- Outdated sources when current ones exist
+- Missing key papers in the field
+- Citing abstracts only when full papers are available
+- Inconsistent or incorrect citation format
+
+---
+
+## Cross-Cutting Considerations
+
+### Reproducibility
+
+Assess across dimensions:
+- Are methods detailed enough to replicate?
+- Are data and code available (or availability explained)?
+- Are analysis steps transparent?
+- Are materials/instruments specified?
+
+### Ethics
+
+Consider:
+- IRB approval (for human subjects)
+- Informed consent
+- Privacy and confidentiality
+- Conflicts of interest
+- Research integrity
+- Data sharing ethics
+
+### Bias and Limitations
+
+Evaluate whether:
+- Potential biases are acknowledged
+- Limitations are discussed honestly
+- Boundary conditions are specified
+- Generalizability is appropriately claimed
+
+### Impact and Significance
+
+Consider:
+- Theoretical contribution
+- Practical implications
+- Policy relevance
+- Methodological innovation
+- Field advancement
+
+---
+
+## Scoring Guidelines
+
+### Dimension Weighting (Suggested, Adjust by Context)
+
+- Problem Formulation: 15%
+- Literature Review: 15%
+- Methodology: 20%
+- Data Collection: 10%
+- Analysis: 15%
+- Results: 10%
+- Writing: 10%
+- Citations: 5%
+
+### Overall Assessment Thresholds
+
+- **Exceptional (4.5-5.0)**: Ready for top-tier publication
+- **Strong (4.0-4.4)**: Publication-ready with minor revisions
+- **Good (3.5-3.9)**: Major revisions required, promising work
+- **Acceptable (3.0-3.4)**: Significant revisions needed
+- **Weak (2.0-2.9)**: Fundamental issues, major rework required
+- **Poor (<2.0)**: Not suitable for publication without complete revision
+
+### Contextual Adjustments
+
+Adjust standards based on:
+- **Stage**: Proposal < Draft < Final submission
+- **Venue**: Student thesis < Conference < Journal < Top-tier journal
+- **Type**: Theoretical < Empirical < Meta-analysis
+- **Field**: Standards vary by discipline
+- **Purpose**: Educational < Professional < Publication
+
+---
+
+## Using This Framework
+
+1. **Read the work thoroughly** before beginning evaluation
+2. **Score each dimension** using the 5-point scale
+3. **Document evidence** for each score with specific examples
+4. **Consider context** and adjust expectations appropriately
+5. **Synthesize findings** across dimensions
+6. **Provide actionable feedback** prioritized by impact
+7. **Balance criticism with recognition** of strengths
+
+This framework is a guide, not a rigid checklist. Professional judgment should always be applied in context.
--- a/skills/scholar-evaluation/scripts/calculate_scores.py
+++ b/skills/scholar-evaluation/scripts/calculate_scores.py
@@ -0,0 +1,378 @@
+#!/usr/bin/env python3
+"""
+ScholarEval Score Calculator
+
+Calculate aggregate evaluation scores from dimension-level ratings.
+Supports weighted averaging, threshold analysis, and score visualization.
+
+Usage:
+    python calculate_scores.py --scores <dimension_scores.json> --output <report.txt>
+    python calculate_scores.py --scores <dimension_scores.json> --weights <weights.json>
+    python calculate_scores.py --interactive
+
+Author: ScholarEval Framework
+License: MIT
+"""
+
+import json
+import argparse
+import sys
+from typing import Dict, List, Optional
+from pathlib import Path
+
+
+# Default dimension weights (total = 100%)
+DEFAULT_WEIGHTS = {
+    "problem_formulation": 0.15,
+    "literature_review": 0.15,
+    "methodology": 0.20,
+    "data_collection": 0.10,
+    "analysis": 0.15,
+    "results": 0.10,
+    "writing": 0.10,
+    "citations": 0.05
+}
+
+# Quality level definitions
+QUALITY_LEVELS = {
+    (4.5, 5.0): ("Exceptional", "Ready for top-tier publication"),
+    (4.0, 4.4): ("Strong", "Publication-ready with minor revisions"),
+    (3.5, 3.9): ("Good", "Major revisions required, promising work"),
+    (3.0, 3.4): ("Acceptable", "Significant revisions needed"),
+    (2.0, 2.9): ("Weak", "Fundamental issues, major rework required"),
+    (0.0, 1.9): ("Poor", "Not suitable without complete revision")
+}
+
+
+def load_scores(filepath: Path) -> Dict[str, float]:
+    """Load dimension scores from JSON file."""
+    try:
+        with open(filepath, 'r') as f:
+            scores = json.load(f)
+
+        # Validate scores
+        for dim, score in scores.items():
+            if not 1 <= score <= 5:
+                raise ValueError(f"Score for {dim} must be between 1 and 5, got {score}")
+
+        return scores
+    except FileNotFoundError:
+        print(f"Error: File not found: {filepath}")
+        sys.exit(1)
+    except json.JSONDecodeError:
+        print(f"Error: Invalid JSON in {filepath}")
+        sys.exit(1)
+    except ValueError as e:
+        print(f"Error: {e}")
+        sys.exit(1)
+
+
+def load_weights(filepath: Optional[Path] = None) -> Dict[str, float]:
+    """Load dimension weights from JSON file or return defaults."""
+    if filepath is None:
+        return DEFAULT_WEIGHTS
+
+    try:
+        with open(filepath, 'r') as f:
+            weights = json.load(f)
+
+        # Validate weights sum to 1.0
+        total = sum(weights.values())
+        if not 0.99 <= total <= 1.01:  # Allow small floating point errors
+            raise ValueError(f"Weights must sum to 1.0, got {total}")
+
+        return weights
+    except FileNotFoundError:
+        print(f"Error: File not found: {filepath}")
+        sys.exit(1)
+    except json.JSONDecodeError:
+        print(f"Error: Invalid JSON in {filepath}")
+        sys.exit(1)
+    except ValueError as e:
+        print(f"Error: {e}")
+        sys.exit(1)
+
+
+def calculate_weighted_average(scores: Dict[str, float], weights: Dict[str, float]) -> float:
+    """Calculate weighted average score."""
+    total_score = 0.0
+    total_weight = 0.0
+
+    for dimension, score in scores.items():
+        # Handle dimension name variations (e.g., "problem_formulation" vs "problem-formulation")
+        dim_key = dimension.replace('-', '_').lower()
+        weight = weights.get(dim_key, 0.0)
+
+        total_score += score * weight
+        total_weight += weight
+
+    # Normalize if not all dimensions were scored
+    if total_weight > 0:
+        return total_score / total_weight * (sum(weights.values()) / total_weight)
+    return 0.0
+
+
+def get_quality_level(score: float) -> tuple:
+    """Get quality level description for a given score."""
+    for (low, high), (level, description) in QUALITY_LEVELS.items():
+        if low <= score <= high:
+            return level, description
+    return "Unknown", "Score out of expected range"
+
+
+def generate_bar_chart(scores: Dict[str, float], max_width: int = 50) -> str:
+    """Generate ASCII bar chart of dimension scores."""
+    lines = []
+    max_name_len = max(len(name) for name in scores.keys())
+
+    for dimension, score in sorted(scores.items(), key=lambda x: x[1], reverse=True):
+        bar_length = int((score / 5.0) * max_width)
+        bar = '█' * bar_length
+        padding = ' ' * (max_name_len - len(dimension))
+        lines.append(f"  {dimension}{padding} │ {bar} {score:.2f}")
+
+    return '\n'.join(lines)
+
+
+def identify_strengths_weaknesses(scores: Dict[str, float]) -> tuple:
+    """Identify top strengths and areas for improvement."""
+    sorted_scores = sorted(scores.items(), key=lambda x: x[1], reverse=True)
+
+    strengths = [dim for dim, score in sorted_scores[:3] if score >= 4.0]
+    weaknesses = [dim for dim, score in sorted_scores[-3:] if score < 3.5]
+
+    return strengths, weaknesses
+
+
+def generate_report(scores: Dict[str, float], weights: Dict[str, float],
+                   output_file: Optional[Path] = None) -> str:
+    """Generate comprehensive evaluation report."""
+    overall_score = calculate_weighted_average(scores, weights)
+    quality_level, quality_desc = get_quality_level(overall_score)
+    strengths, weaknesses = identify_strengths_weaknesses(scores)
+
+    report_lines = [
+        "="*70,
+        "SCHOLAREVAL SCORE REPORT",
+        "="*70,
+        "",
+        f"Overall Score: {overall_score:.2f} / 5.00",
+        f"Quality Level: {quality_level}",
+        f"Assessment: {quality_desc}",
+        "",
+        "="*70,
+        "DIMENSION SCORES",
+        "="*70,
+        "",
+        generate_bar_chart(scores),
+        "",
+        "="*70,
+        "DETAILED BREAKDOWN",
+        "="*70,
+        ""
+    ]
+
+    # Add detailed scores with weights
+    for dimension, score in sorted(scores.items()):
+        dim_key = dimension.replace('-', '_').lower()
+        weight = weights.get(dim_key, 0.0)
+        weighted_contribution = score * weight
+        percentage = weight * 100
+
+        report_lines.append(
+            f"  {dimension:25s} {score:.2f}/5.00  "
+            f"(weight: {percentage:4.1f}%, contribution: {weighted_contribution:.3f})"
+        )
+
+    report_lines.extend([
+        "",
+        "="*70,
+        "ASSESSMENT SUMMARY",
+        "="*70,
+        ""
+    ])
+
+    if strengths:
+        report_lines.append("Top Strengths:")
+        for dim in strengths:
+            report_lines.append(f"  • {dim}: {scores[dim]:.2f}/5.00")
+        report_lines.append("")
+
+    if weaknesses:
+        report_lines.append("Areas for Improvement:")
+        for dim in weaknesses:
+            report_lines.append(f"  • {dim}: {scores[dim]:.2f}/5.00")
+        report_lines.append("")
+
+    # Add recommendations based on score
+    report_lines.extend([
+        "="*70,
+        "RECOMMENDATIONS",
+        "="*70,
+        ""
+    ])
+
+    if overall_score >= 4.5:
+        report_lines.append("  Excellent work! Ready for submission to top-tier venues.")
+    elif overall_score >= 4.0:
+        report_lines.append("  Strong work. Address minor issues identified in weaknesses.")
+    elif overall_score >= 3.5:
+        report_lines.append("  Good foundation. Focus on major revisions in weak dimensions.")
+    elif overall_score >= 3.0:
+        report_lines.append("  Significant revisions needed. Prioritize weakest dimensions.")
+    elif overall_score >= 2.0:
+        report_lines.append("  Major rework required. Consider restructuring approach.")
+    else:
+        report_lines.append("  Fundamental revision needed across multiple dimensions.")
+
+    report_lines.append("")
+    report_lines.append("="*70)
+
+    report = '\n'.join(report_lines)
+
+    # Write to file if specified
+    if output_file:
+        try:
+            with open(output_file, 'w') as f:
+                f.write(report)
+            print(f"\nReport saved to: {output_file}")
+        except IOError as e:
+            print(f"Error writing to {output_file}: {e}")
+
+    return report
+
+
+def interactive_mode():
+    """Run interactive score entry mode."""
+    print("ScholarEval Interactive Score Calculator")
+    print("="*50)
+    print("\nEnter scores for each dimension (1-5):")
+    print("(Press Enter to skip a dimension)\n")
+
+    scores = {}
+    dimensions = [
+        "problem_formulation",
+        "literature_review",
+        "methodology",
+        "data_collection",
+        "analysis",
+        "results",
+        "writing",
+        "citations"
+    ]
+
+    for dim in dimensions:
+        while True:
+            dim_display = dim.replace('_', ' ').title()
+            user_input = input(f"{dim_display}: ").strip()
+
+            if not user_input:
+                break
+
+            try:
+                score = float(user_input)
+                if 1 <= score <= 5:
+                    scores[dim] = score
+                    break
+                else:
+                    print("  Score must be between 1 and 5")
+            except ValueError:
+                print("  Invalid input. Please enter a number between 1 and 5")
+
+    if not scores:
+        print("\nNo scores entered. Exiting.")
+        return
+
+    print("\n" + "="*50)
+    print("SCORES ENTERED:")
+    for dim, score in scores.items():
+        print(f"  {dim.replace('_', ' ').title()}: {score}")
+
+    print("\nCalculating overall assessment...\n")
+
+    report = generate_report(scores, DEFAULT_WEIGHTS)
+    print(report)
+
+    # Ask if user wants to save
+    save = input("\nSave report to file? (y/n): ").strip().lower()
+    if save == 'y':
+        filename = input("Enter filename [scholareval_report.txt]: ").strip()
+        if not filename:
+            filename = "scholareval_report.txt"
+        generate_report(scores, DEFAULT_WEIGHTS, Path(filename))
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Calculate aggregate ScholarEval scores from dimension ratings",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Calculate from JSON file
+  python calculate_scores.py --scores my_scores.json
+
+  # Calculate with custom weights
+  python calculate_scores.py --scores my_scores.json --weights custom_weights.json
+
+  # Save report to file
+  python calculate_scores.py --scores my_scores.json --output report.txt
+
+  # Interactive mode
+  python calculate_scores.py --interactive
+
+Score JSON Format:
+  {
+    "problem_formulation": 4.5,
+    "literature_review": 4.0,
+    "methodology": 3.5,
+    "data_collection": 4.0,
+    "analysis": 3.5,
+    "results": 4.0,
+    "writing": 4.5,
+    "citations": 4.0
+  }
+
+Weights JSON Format:
+  {
+    "problem_formulation": 0.15,
+    "literature_review": 0.15,
+    "methodology": 0.20,
+    "data_collection": 0.10,
+    "analysis": 0.15,
+    "results": 0.10,
+    "writing": 0.10,
+    "citations": 0.05
+  }
+        """
+    )
+
+    parser.add_argument('--scores', type=Path, help='Path to JSON file with dimension scores')
+    parser.add_argument('--weights', type=Path, help='Path to JSON file with dimension weights (optional)')
+    parser.add_argument('--output', type=Path, help='Path to output report file (optional)')
+    parser.add_argument('--interactive', '-i', action='store_true', help='Run in interactive mode')
+
+    args = parser.parse_args()
+
+    # Interactive mode
+    if args.interactive:
+        interactive_mode()
+        return
+
+    # File mode
+    if not args.scores:
+        parser.print_help()
+        print("\nError: --scores is required (or use --interactive)")
+        sys.exit(1)
+
+    scores = load_scores(args.scores)
+    weights = load_weights(args.weights)
+
+    report = generate_report(scores, weights, args.output)
+
+    # Print to stdout if no output file specified
+    if not args.output:
+        print(report)
+
+
+if __name__ == '__main__':
+    main()