博文

Python提取文本或网页上的缩写词

已有 3305 次阅读 2016-11-13 16:56 |个人分类:Python|系统分类:科研笔记| 缩写, 提取

import re

import urllib

from bs4 import BeautifulSoup

url = "http://journals.plos.org/plosone/article?id=info%3Adoi/10.1371/journal.pone.0162069"

response = urllib.urlopen(url)

page = response.read()

soup = BeautifulSoup(page, "lxml")

# kill all script and style elements

for script in soup(["script", "style"]):

script.extract() # rip it out

# get text

text = soup.get_text()

# break into lines and remove leading and trailing space on each

#lines = (line.strip() for line in text.splitlines())

# break multi-headlines into a line each

#chunks = (phrase.strip() for line in lines for phrase in line.split(" "))

# drop blank lines

#text = 'n'.join(chunk for chunk in chunks if chunk)

#print(text)

#print type(soup)

#print soup.prettify()

#You can change the regex if it doesn't work properly.

pattern = re.compile(r"(?<=s).{0,2}w*([A-Z]{2}|([A-Z]w[A-Z]))w*.{0,2}(?=s)")

result_list1 = pattern.findall(text)

#Delete repeated elements.

result_set = set(result_list1)

result_list2 = list(result_set)

#结果暂时不理想。熟练掌握bs4以后再修改

print result_list2

转载本文请联系原作者获取授权，同时请注明本文来自吕波科学网博客。
链接地址：https://wap.sciencenet.cn/blog-645111-1014531.html

上一篇：Python提取网页中的文本

收藏 IP: 110.200.51.*| 热度|

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

吕波

扫一扫，分享此博文

全部作者的其他最新博文

• Python提取网页中的文本
• Python提取句子

xyzg198891的个人博客分享 http://blog.sciencenet.cn/u/xyzg198891

博文

Python提取文本或网页上的缩写词

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

吕波

全部作者的其他最新博文

全部精选博文导读

相关博文

xyzg198891的个人博客分享 http://blog.sciencenet.cn/u/xyzg198891

博文

Python提取文本或网页上的缩写词

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

吕波

全部作者的其他最新博文

全部精选博文导读

相关博文

该博文允许注册用户评论请点击登录评论 (0 个评论)