许培扬博客分享 http://blog.sciencenet.cn/u/xupeiyang 跟踪国际前沿 服务国内科研

博文

信息分析(文献主题分析)受控语言与自然语言相结合

已有 2999 次阅读 2013-1-8 14:38 |个人分类:信息分析|系统分类:科研笔记

      信息分析工作中经常需要对文献内容进行计量分析,也就是主题分析,最好采用受控语言与自然语言相结合的方法。
      采用主题词或关键词(或文本词)单一语言对文献进行内容分析都有局限性,因为主题词的选定、主题词的数量、标引深度、标引规则都是人为制定的。无论人工标引还是计算机自动标引,都会有误标引和漏标引的问题,这对文献的主题计量分析结果的准确性是有影响的。关键词因为不规范,存在大量同义词、近义词、简称词、缩写词、不同拼写词等,有些关键词还会有概念歧义问题,因此对文献计量分析结果的全面、准确也有影响。
 
1、  采用主题词(MESH)分析的信息平台
 
医学文献数据分析(PUBMED/MEDLINE数据  http://www.gopubmed.org/web/gopubmed/1?WEB0fyuvjqgch8orIeIpI0
 
2、  采用关键词分析的信息平台
 
 
微软学术搜索数据分析(http://academic.research.microsoft.com/PublicationList?query=targeted proteomics&desType=8&desID=11039&SearchDomain=4)
 
3、  采用主题词与关键词结合的信息平台
 
 
 
采用主题词、关键词分析的实例:
 
 
美国PUBMED医学数据库22,454,842 篇文献的受控语言(主题词MESH)分布如下:
 
 
‍‍‍主题词 Humans [  文献量   12.401 M]
‍‍‍ Patients [4.219 M]
‍‍‍ Adult [5.495 M]
‍‍‍ Middle Aged [3.019 M]
‍‍‍ Aged [2.161 M]
‍‍‍ Evaluation Studies as Topic [2.647 M]
‍‍‍ Animals [16.478 M]
‍‍‍ Child [1.704 M]
‍‍‍ Adolescent [1.514 M]
‍‍‍ Rats [1.374 M]
‍ ‍‍Genes [1.647 M]
‍‍‍ Mice [1.213 M]
‍‍‍ Proteins [6.053 M]
‍‍‍ Methods [0.936 M]
‍‍‍ Time Factors [0.935 M]
‍‍‍ Pharmaceutical Preparations [1.634 M]
‍‍‍ DNA [1.080 M]
‍‍‍ Hospitals [0.814 M]
‍‍‍ United States [1.245 M]
‍‍‍ Hospitalization [0.795 M]
‍‍‍ Pregnancy [0.816 M]
‍‍‍ Women [0.729 M]
‍‍‍ Neoplasms [2.855 M]
‍‍‍ antigen binding [0.717 M]
‍‍‍ Child, Preschool [0.715 M]
‍‍‍ Serum [0.789 M]
‍‍‍ Infant [1.050 M]
‍‍‍ Surgery [0.704 M]
‍‍‍ Syndrome [0.678 M]
‍‍‍ Tissues [4.034 M]
‍‍‍ Risk Factors [0.661 M]
‍‍‍ membrane [0.929 M]
‍‍‍ Arteries [0.823 M]
‍‍‍ Membranes [1.004 M]
‍‍‍ Antibodies [1.023 M]
‍ ‍‍Incidence [0.606 M]
‍‍‍ Nature [0.606 M]
‍‍‍ Infant, Newborn [0.594 M]
‍‍‍ Pressure [0.632 M]
‍‍‍ Treatment Outcome [0.609 M]
‍‍‍ Kinetics [0.573 M]
‍‍‍ Mutation [0.795 M]
‍‍‍ Cell Line [0.781 M]
‍‍‍ Viruses [0.923 M]
‍‍‍ Muscles [0.823 M]
‍‍‍ Lung [0.535 M]
‍‍‍ Immunization [0.622 M]
‍‍‍ Diagnosis [7.440 M]
‍‍‍ Recurrence [477,165]
‍‍‍ Electrons [472,898]
‍‍‍ Sensitivity and Specificity [0.571 M]
‍‍‍ Follow-Up Studies [460,997]
‍‍‍ Prevalence [455,159]
‍‍‍ Temperature [482,554]
‍‍‍ Carcinoma [0.717 M]
‍‍‍ RNA, Messenger [0.502 M]
‍‍‍ Cells, Cultured [1.340 M]
‍‍‍ Electronics [0.621 M]
‍‍‍ Personal Autonomy [421,477]
‍‍‍ Kidney [466,162]
‍ ‍‍Neurons [0.653 M]
‍‍‍ Oxides [1.001 M]
‍‍‍ metabolic process [2.554 M]
‍‍‍ Enzymes [2.993 M]
‍‍‍ Wounds and Injuries [1.083 M]
‍‍‍ Oxygen [392,286]
‍‍‍ Calcium [378,588]
‍‍‍ Hepatitis [384,202]
‍‍‍ pathogenesis [364,891]
‍‍‍ Antigens [1.192 M]
‍‍‍ Hypertension [359,726]
‍‍‍ Rabbits [310,550]
‍‍‍ signal transduction [485,473]
‍‍‍ Bisphenol A-Glycidyl Methacrylate [308,080]
‍‍‍ gene expression [466,836]
‍‍‍ Chronic Disease [304,487]
‍‍‍ Escherichia coli [301,568]
‍‍‍ Dogs [285,401]
‍‍‍ Postoperative Complications [418,333]
‍‍‍ Cattle [282,257]
‍‍‍ cytolysis [280,910]
‍‍‍ cell death [0.719 M]
‍‍‍ Hemorrhage [384,651]
‍‍‍ Nonoxynol [257,332]
‍‍‍ regulation of gene expression [438,318]
‍ ‍‍catabolic process [0.534 M]
‍‍‍ gut development [243,410]
‍‍‍ blood circulation [410,923]
‍‍‍ Depression [232,632]
‍‍‍ Parents [255,311]
‍‍‍ Rats, Sprague-Dawley [229,263]
‍‍‍ plasma membrane [398,797]
‍‍‍ extracellular region [0.578 M]
‍‍‍ receptor binding [0.988 M]
‍‍‍ Great Britain [330,086]
‍‍‍ site-specific telomere resolvase activity [183,231]
‍‍‍ intracellular [1.657 M]
‍‍‍ chromosome [366,463]
‍‍‍ Vitamins [156,285]
‍‍‍ Collagen [160,552]
 
 
微软学术搜索中医学文献的关键词(自然语言)分布如下:
 
总共有关键词37,730 个
Top keywords in Medicine
 
学科分类
Anatomy1
Andrology2
Cardiology3
Dentistry4
Dermatology5
Diabetes6
Diseases7
Emergency & Critical Care8
Endocrinology9
Family Medicine10
Gynecology & Obstetrics11
Immunology12
Medical Education & Training13
Neuroscience14
Nursing15
Nutrition16
Oncology17
Ophthalmology18
Pathology19
Pharmacology20
Physiology21
Psychiatry & Psychology22
Urology23
Overall for Medicine0
 
按文献被引量排序       关键词     文献量   
1
Genetics 110748
    文献被引量    
1591978
2
Risk Factors 113600 1489525
3
Enzyme 87346 1138880
4
Cell Line 74944 1033424
5
Gene Expression 59419 960749
6
Breast Cancer 77534 891335
7
Confidence Interval 45087 821640
8
Central Nervous System 46158 753974
9
Indexation 69426 701420
10
Dopamine 40403 699249
11
Immune Response 42449 675040
12
Clinical Trial 56987 658594
13
Control Group 66147 655947
14
United States 46445 652237
15
Polymorphism 53968 652002
16
Statistical Significance 62130 647581
17
Prospective Study 43765 635380
18
Endothelial Cell 37291 629286
19
Monoclonal Antibody 46069 628603
20
B Cell 38263 609460
21
Amino Acid 41029 607522
22
Animal Model 42739 604426
23
Blood Pressure 58079 594620
24
Growth Factor 31719 590405
25
High Risk 51012 588852
26
Magnetic Resonance Image 52139 582106
27
Myocardial Infarct 41631 570068
28
Nitric Oxide 35724 564410
29
Tumor Cells 39037 560334
30
Cell Death 27183 542359
31
Quality of Life 53459 540540
32
Transcription Factor 25560 540463
33
meta analysis 24477 528970
34
Cardiovascular Disease 36029 516003
35
Wild Type 29918 514508
36
Oxidant Stress 34727 509287
37
Spinal Cord 37125 506676
38
Prostate Cancer 44722 503617
39
Spectrum 42017 502671
40
Kinetics 44325 486844
41
Immunohistochemistry 40128 485905
42
Side Effect 44706 471689
43
Rat Brain 27702 468513
44
Glutamate 22845 459523
45
Odd Ratio 27465 448244
46
Bone Marrow 34457 444413
47
Cell Proliferation 29089 441908
48
Dendritic Cell 20085 441733
49
Serotonin 29600 434088
50
Body Weight 37072 432794
51
Ultrasound 55015 424048
52
Heart Failure 38410 423566
53
Degeneration 28746 423533
54
Tumor Necrosis Factor 22696 421151
55
Epithelial Cell 29270 415833
56
Astrocyte 19258 390919
57
T Lymphocyte 22776 388370
58
Cell Surface 20329 387219
59
Signaling Pathway 22376 380243
60
Stem Cell 24257 379959
61
Rheumatoid Arthritis 33154 372778
62
Molecular Mechanics 19733 367851
63
Nervous System 19688 357072
64
Control Subjects 20409 353928
65
Randomized Controlled Trial 23276 352136
66
Human Immunodeficiency Virus 22456 351993
67
Low Dose 32190 350044
68
Cohort Study 25596 349510
69
High Dose 31146 349150
70
Randomized Trial 20062 346909
71
Human Brain 13910 345454
72
Coronary Artery Disease 28685 345011
73
Blood Flow 33746 342178
74
diabetes mellitus 32812 339977
75
Insulin Resistant 21472 339827
76
Lymph Node 28570 339654
77
Computed Tomography 43005 339390
78
Prostaglandin 32526 335549
79
Glycoprotein 23439 335210
80
Immune System 18508 333687
81
Cross Section 27502 333429
82
Immunoglobulin 24876 333321
83
Clinical Outcome 27734 332116
84
Western Blot 30583 330110
85
Systematic Review 22532 327609
86
Polymerase Chain Reaction 23876 327383
87
Mrna Expression 24670 322877
88
Body Mass Index 22067 321171
89
Multiple Sclerosis 25562 320685
90
Skeletal Muscle 26748 319039
91
Binding Site 21867 318089
92
Prefrontal Cortex 10579 316790
93
Heart Rate 33745 315596
94
type 2 diabetes 24877 315320
95
Cerebral Cortex 16292 314797
96
Case Report 101551 313769
97
Colorectal Cancer 25838 313768
98
Positron Emission Tomography 16745 313655
99
Transgenic Mice 15058 312614
100
Time Course 20333 312142
 
 
数据来源:


https://wap.sciencenet.cn/blog-280034-650971.html

上一篇:南方末
下一篇:北京地铁新线话古今
收藏 IP: 222.35.21.*| 热度|

1 王晓光

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-22 00:05

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部