Chpater 02 텍스트 마이닝 첫걸음

티스토리 뷰

카테고리 없음

Chpater 02 텍스트 마이닝 첫걸음

키성열 2022. 3. 9. 22:08

2.1 웹 크롤링으로 기초 데이터 수집하기

대상 페이지 구조 살펴보기

import requests
from bs4 import BeautifulSoup
import re

source_url='https://namu.wiki/RecentChanges'

req=requests.get(source_url)
html=req.content
soup=BeautifulSoup(html,'lxml')
contents_table=soup.find(name='table', attrs={'class':'table-hover'})
table_body=contents_table.find(name='tbody')
table_rows=table_body.find_all(name='tr')

page_url_base="https://namu.wiki"
page_urls=[]
for index in range(0, len(table_rows)):
    first_td=table_rows[index].find_all('td')[0]
    td_url=first_td.find_all('a')
    if len(td_url)>0:
        page_url=page_url_base+td_url[0].get('href')
        page_urls.append(page_url)
    
page_urls=list(set(page_urls))
for page in page_urls[:5]:
    print(page)

새로운 라이브러리를 통해 크롤링을 한다. 근데 코드가 안 먹힌다. 책이 이상한 거 같다

텍스트 정보 수집하기

req=requests.get(page_urls[0])
html=req.content
soup=BeautifulSoup(html,'lxml')
title=soup.find(name='hi',attrs={"class":"title"})
category=soup.find(name="div",attrs={"class":"wiki-catagory"})
content_clearfix=soup.find(name='div',attrs={"class":'wiki-content clearfix'})
print(title.text)
print(category.text)
print(content_clearfix.text)

코드가 안된다.

2.2 나무위키 최근 변경 페이지 키워드 분석하기

코드가 안된다 버린다

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

글 보관함

일상 일기 블로그

티스토리 뷰

Chpater 02 텍스트 마이닝 첫걸음

티스토리툴바