In the above, we have roughly understood the usage of pydub library. Today's goal is to write a crawler to crawl song information.
For web crawlers, there are corresponding packages in Python's standard library, which can be opened directly: https://docs.python.org/zh-cn/ Go to see the official Chinese documents of the corresponding version of python (this website is very useful and is recommended to be collected by small partners learning Python). Of course, the official documents are generally obscure. You can search some tutorials to eat the best.
Through learning about python, web crawlers can use the traditional urllib library or the more advanced Requests library. Urllib is selected for the time being. Of which urlib The request module is used to open the url. The usage is as follows:
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
It seems very complicated, but other defaults can not be filled in. We just need to give the url parameter. Open Baidu Encyclopedia to search for fireworks and find that the url of the web page is as follows: https://baike.baidu.com/item/ Fireworks are easy to cool / 211 , try changing the url, enter: https://baike.baidu.com/item/ Qilixiang , go to and successfully enter the baidu entry interface of Qilixiang, but the url is automatically updated to: https://baike.baidu.com/item/ Qilixiang / 2181450 (can be used, nice).
Observing the web page, there is a new problem, that is, Qilixiang has polysemy, and the default is Jay Chou's album Qilixiang, not Jay Chou's song Qilixiang. Open the source code of fireworks easy to cool and Qilixiang search results respectively, and observe:
<li class="item">▪<span class="selected">Jay Chou sings songs</span></li> <li class="item">▪<span class="selected">2004 Jay Chou's music album</span></li>
It can be found that their line of code is different. In addition, near the latter line of code, there are the following codes:
<li class="item">▪<span class="selected">2004 Jay Chou's music album</span></li> <li class="item">▪<a title="Xi Murong's poetry collection" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/2181435#Viewpagecontent '> Xi Murong's poetry collection</a></li> <li class="item">▪<a title="2007 Thai TV series" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/2181466#Viewpagecontent '> 2007 Thai TV series</a></li> <li class="item">▪<a title="Chen Shuhua sings songs" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/2172939#Viewpagecontent '> Chen Shuhua sings songs</a></li> <li class="item">▪<a title="traditional Chinese medicine" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/4494994#Viewpagecontent '> traditional Chinese Medicine</a></li> <li class="item">▪<a title="scenic spot" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/3518031#Viewpagecontent '> tourist attractions</a></li> <li class="item">▪<a title="2005 Books published by the Central Compilation publishing house in" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/20490760#Viewpagecontent '> books published by the Central Compilation and Translation Press in 2005</a></li> <li class="item">▪<a title="Novel Qi Li Xiang" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/3922533#Viewpagecontent '> novel Qi Li Xiang</a></li> <li class="item">▪<a title="Thymus of Rutaceae" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/4499679#Viewpagecontent '> thyme of Rutaceae</a></li> <li class="item">▪<a title="Jay Chou sang songs in Taiwan in 2004" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/12009481#Viewpagecontent '> Jay Chou sang songs in Taiwan in 2004</a></li> <li class="item">▪<a title="Snacks in Taiwan, China" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/2181417#ViewPageContent'> Taiwan China snacks </a></li> <li class="item">▪<a title="Xi Murong creates new poetry" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/22593324#Viewpagecontent '> Xi Murong creates new poems</a></li> <li class="item">▪<a title="Dark night literature network novel" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/22781892#Viewpagecontent '> dark night literature network novel</a></li> <a href="javascript:;" class="fold-on">Expand all<em class="cmn-icon cmn-icons cmn-icons_arrow-b"></em></a> <a href="javascript:;" class="fold-off">Put away<em class="cmn-icon cmn-icons cmn-icons_arrow-t"></em></a>
It is found that there is an option of "Jay Chou singing songs in Taiwan in 2004". The same place with the former is that there is a common keyword "Jay Chou singing songs". Next, continue to find the information we need in this paragraph:
<meta name="description" content="<Fireworks are easy to cool "is a song composed by Fang Wenshan, composed by Huang YuXun and composed and sung by Jay Chou. It is included in Jay Chou's album cross era released on May 18, 2010. In 2011, the song won the "Golden Melody of the year" at the 2010 Beijing pop music ceremony."> <meta name="description" content="<Qilixiang is a song sung by Jay Chou. It is composed by Fang Wenshan, composed by Jay Chou and arranged by Zhong Xingmin. It is included in Jay Chou's album of the same name "Qilixiang" released on August 3, 2004. In 2004, the song won Hong Kong TVB8 There are three awards for the best composition, producer and arranger of the top ten golden songs. In 2005, the song won many awards, such as the 27th top ten Chinese Golden Song Award, the excellent popular Chinese song award, and the best song of the year in the 11th global Chinese music list.">
So, is there any difference between composition, arrangement and composition? use Baidu Search:
1. Conceptual difference: composition generally refers to composing melody for lyrics; Arrangement generally refers to the accompaniment of songs; Composing music is to write down the existing music and write it into simplified music, staff, etc.
2. Order difference: first composition, then arrangement and composition.
Well, I've seen a lot. Here, the preparatory work is about the same.
This is a statistical table comparing song tag information with ffmpeg library on different platforms:
Windows | iTunes(Info tab) | id3v2.3 | ffmpeg key | ffmpeg example |
---|---|---|---|---|
Title | Title | TIT2 | title | -metadata title = "vast sea and sky" |
Subtitle | Description(Video tab) | TIT3 | TIT3 | -metadata TIT3 = "beyond 20 th Anniversary Edition" |
Rating | n/a | n/a | n/a | n/a |
Comments | Comments | COMM | n/a | n/a |
Contributing artists | Artist | TPE1 | artist | -metadata artist = "Huang Jiaju" |
Album artist | Album artist | TPE2 | album_artist | -metadata album_artist="Josh Groban" |
Album | Album | TALB | album | -metadata album="Closer" |
Year | Year | TYER | date | -metadata date="2009" |
# | Track Number | TRCK | track | -metadata track = "3 / 12" |
Genre | Genre | TCON | genre | -metadata genre="Vocal" |
Publisher | n/a | TPUB | publisher | -metadata publisher="Heaven Church" |
Encoded by | n/a | TENC | encoded_by | -metadata encoded_by="Joshua" |
Aythor URL | n/a | WOAR | n/a | n/a |
Copyright (non editable) | n/a | TCOP | copyright | -metadata copyright="℗ lqsoft" |
Composers | n/a | TCOM | composer | -metadata composer="Joshua" |
Conductors | n/a | TPE3 | performer | -metadata performer="Joshua" |
Group description | Grouping | TIT1 | TIT1 | -metadata TIT1="The Classics" |
Mood | n/a | n/a | n/a | n/a |
Part of set | Disc Number | TPOS | disc | -metadata disc="1/2" |
Initial key | n/a | TKEY | TKEY | -metadata TKEY="G" |
Beats-per-minute | BOM | TBPM | TBPM | -metadata TBPM="120" |
Part of a compilation | Part of a compilation | TCMP | n/a | n/a |
n/a | n/a | TLAN | language | -metadata language="eng" |
n/a | n/a | TSSE | encoder | -metadata encoder="iTunes v10" |
Because the store's resource file name has a numeric number:
01. Cowboys are busy wav
01. Said goodbye wav
So first write a script to rename it and export the song list:
import os import re pattern=[r"^[0-9]+\.",r"\.wav"] dir='E:\\BaiduNetdiskDownload\\Jay Chou' os.chdir(dir) raw_dir_list=os.listdir(dir) dir_list=list() for file in raw_dir_list: tmp=re.sub(pattern[0],"",file) str=re.sub(pattern[1],"",tmp) dir_list.append(str) os.rename(file,tmp) with open("song_list.txt","w") as p: for file in dir_list: p.write(file+"\n")
The effect of the list is as follows (the file name is suffixed with ". wav"):
Qilixiang
doomsday
Dongfeng break
Uncle Joker
Next comes the crawler script:
from urllib import request from urllib import parse import re import os def getlist(file): with open(file,"r") as p: list=p.read().split("\n") while '' in list: list.remove('') return list def crawtext(url): res=request.urlopen(url) text=res.read().decode(encoding='utf-8', errors='strict') return text def isurl(patternlist,text): if re.search(patternlist[0],text): a=re.search(patternlist[1],text) if a: flag=0 else : flag=2 else : flag=1 return flag def gettext(pattern,raw_text): a=re.search(pattern,raw_text) if a: text=raw_text[a.span()[0]:a.span()[1]] else : text=False return text def geturl(pattern,patternlist,raw_text): a=re.search(pattern,raw_text) if a: text=raw_text[a.span()[0]:a.span()[1]] tmp=re.sub(patternlist[0],"",text) url=re.sub(patternlist[1],"",tmp) else : url=False return url baseurl=r"https://baike.baidu.com/item/" pattern1=['<li class="item">▪<span class="selected">','<li class="item">▪<span class="selected">.*Jay Chou.*song.*</span></li>'] pattern2='<meta name="description" content=".*">' pattern3='<li class="item">▪<a title=".*Jay Chou.*song.*>' pattern4=[".*href='/item/","'>.*"] dir="E:\\BaiduNetdiskDownload\\Jay Chou" os.chdir(dir) song_list=getlist("song_list.txt") text_list=list() for file in song_list: name=re.sub(".wav","",file) url=baseurl+parse.quote(name) text=crawtext(url) flag=isurl(pattern1,text) if flag==0: text_list.append(gettext(pattern2,text)) elif flag==1: text=gettext(pattern2,text) if text: text_list.append(text) else: text_list.append(name+" error 1 ") else : key=geturl(pattern3,pattern4,text) if key: url=baseurl+key text=crawtext(url) text_list.append(gettext(pattern2,text)) else : text_list.append(name+" error 2 ") with open("text.txt","w") as p: for str in text_list: p.write(str+"\n")
There are still some problems, such as three "error: 2":
Chrysanthemum terrace error 2
Agreed happiness error 2
Track error 2
Open the browser to search and find that Jay Chou's song is called "say good happiness", not "say good happiness", but for "chrysanthemum platform" and "track":
Jay Chou sings the ending song of the film "all over the city with golden armour"
Jay Chou sings the theme song of the film "looking for Jay Chou"
Speechless, there is no keyword "song" in the subtitle. In addition, there are several data errors because the entry does not jump automatically and the singer is not Jay Chou (dedication is a song written by Jay Chou to Chen Xiaochun).
It seems that the script can be optimized. It's troublesome. Just a few anyway. Add them manually and modify the wrong song name. The original data is downloaded successfully, and the effect is as follows:
< meta name = "description" content = "" Qilixiang "is a song sung by Jay Chou. It is composed by Fang Wenshan, composed by Jay Chou and arranged by Zhong Xingmin. It is included in Jay Chou's album of the same name" Qilixiang "released on August 3, 2004 Yes. In 2004, the song won three awards for the best composition, producer and arrangement of the top ten Golden Songs of TVB8 in Hong Kong. In 2005, the song won many awards, such as the 27th top ten Chinese Golden Song Award, the excellent popular Chinese song award, and the best song of the year in the 11th global Chinese music list. ">
Next, clean up the data:
import os import re def getlist(file): with open(file,"r") as p: list=p.read().split("\n") while '' in list: list.remove('') return list class SONG: title="" artist="" album="" date="" composer="" def __init__(self,title) : self.title=title def cuthead(pattern,text): a=re.search(pattern,text) if a: tmp=text[a.span()[1]:-1]+text[-1] str=cuthead(pattern,tmp) else : str=text return str def search1(pattern,text): a=re.search(pattern[0]+".*?"+pattern[1],text) if a: tmp1=text[a.span()[0]:a.span()[1]] tmp2=re.sub(pattern[1],"",tmp1) str=cuthead(pattern[0],tmp2) else: str=False return str def search2(pattern,text): a=re.search(pattern,text) if a: str=text[a.span()[0]:a.span()[1]] else : str=False return str def search3(pattern,text): pass dir="E:\\BaiduNetdiskDownload\\Jay Chou" os.chdir(dir) pattern1=["<",">"] pattern2=["yes","Singing"] pattern3=["song,.",",Included"] pattern4=["Included.*?[0-9]+year[0-9]+month[0-9]+day","[0-9]+year[0-9]+month[0-9]+day"] pattern5=["Album<",">"] textlist=getlist("text.txt") li=[] for text in textlist: title=search1(pattern1,text) song=SONG(title) song.artist=search1(pattern2,text) song.album=search1(pattern5,text) song.date=search2(pattern4[1],str(search2(pattern4[0],text))) song.composer=search1(pattern3,text) li.append(song) with open("list.txt","w") as p: for song in li: p.write(str(song.title)+"\t") p.write(str(song.artist)+"\t") p.write(str(song.album)+"\t") p.write(str(song.date)+"\t") p.write(str(song.composer)+"\n")
If the data is not standardized, clean up two lines of tears. After the program is run, it is still manually checked and modified several non-standard data. The cleaning effect is as follows:
Qilixiang Jay Chou Qilixiang August 3, 2004 Fang Wenshan wrote lyrics, Jay Chou composed music and Zhong Xingmin arranged music
doomsday Jay Chou Fantexi PLUS December 28, 2001 Jay Chou wrote lyrics and music
Dongfeng break Jay Chou Ye Huimei July 31, 2003 Jay Chou composes music, Fang Wenshan fills in lyrics, and Lin Michael arranges music
Next, the last step is format conversion and label addition:
import os import pydub def getlist(file): with open(file,"r") as p: list=p.read().split("\n") while '' in list: list.remove('') return list class SONG: title="" artist="" album="" date="" composer="" def __init__(self,title) : self.title=title dir="E:\\BaiduNetdiskDownload\\Jay Chou" os.chdir(dir) os.mkdir("test") lines=getlist("list.txt") list=[] for line in lines: tmp=line.split("\t") song=pydub.AudioSegment.from_wav(tmp[0]+".wav") dic={"title":tmp[0],"artist":tmp[1],"album":tmp[2],"date":tmp[3],"composer":tmp[4]} song.export("test\\"+tmp[0]+".flac",format="flac",tags=dic) song.export()
Throughout this article, I found that format conversion is the simplest.