您好我有一个链接here的文档,其中包含换行符。但是这些换行符不是由'/ n'创建的,因为当我使用strip()
甚至line[:-2]
时,我似乎无法摆脱它们。我想知道如何删除一些换行符 - 主要是在页面上运行的行:
Wimer John, gauger & cooper, 232 N Broad, h
1511 Callowhill
如果它有帮助,这是pytessaract OCR文本。
谢谢,
卡梅伦
答案 0 :(得分:1)
可能与\r\n
分开?
>>> file = open("1128.txt")
>>> required_stuff = file.read().split('\r\n')
>>> print required_stuff[:10]
['VVIL', '', '1076', '', 'VVIN', '', ' ', '', 'WILSTACH WILLIAM P. & CO. (Wgflliam P.', "TVz'Zsz\xe2\x80\x98ac7z Q\xc2\xbb C/mrles Scott), saddlery hardware,"]
>>> file.close()
答案 1 :(得分:1)
在我看来,每个"记录"您的文件由两个换行符分隔,其中"换行"指的是DOS样式的行结尾,即回车符('\r'
)后跟换行符('\n'
)。因此,我们应首先在'\r\n\r\n'
上拆分字节流,以便为每条记录获取一个元素。
然后,我们可以通过用replace()
替换它们来处理嵌入在记录中的不需要的换行符(必然是未配对的)。在我看来,有些嵌入式换行符会出现一些短划线,因此可能需要用空字符串替换'-\r\n'
来重新连接连字符包裹的文本片段,但之后,我们应该用一个空格替换任何剩余的不成对换行符。
因此我们有:
import re;
file = open('1128.txt');
lines = re.split('\r\n\r\n',file.read());
lines = map(lambda x: x.replace('-\s*\r\n','').replace('\r\n',' '),lines);
for e in lines: print e;
## VVIL
## 1076
## VVIN
##
## WILSTACH WILLIAM P. & CO. (Wgflliam P. TVz'Zsz‘ac7z Q» C/mrles Scott), saddlery hardware, 38 N 3d
## Wilston John (c), nightwork, 917 S 9th
## Wilt Abraham, carter, 915 Coates
## Wilt Abraham, gentleman, 416 N 3d
## Wilt Alpheus, sash 3: doors, 425 N Front, h 1114 Columbia av
## Wilt Charles, blacksmith, N 40th 11 Lancaster av
## Wilt Charles, flour & feed store, 1306 South
## Wilt Conrad, butcher, stall 33 Kater Market, h N W Wharton & Church
## Wilt George, carter, 1135 Brown
## Wilt George A., despatcher, Reading av & Richmond, h 1114 E Columbia av
## Wilt Henry, tinsmith, 888 N 2d, h 1007 Olive
## Wilt Jacob, cloak manuf., 230 Crown
## Wilt Jacob, shoemaker, 819 St John
## Wilt Jacob J., shipjoiner, 1037 Sarah
## Wilt James A., dealer in fancy goods, 230 Crown
## Wilt James G., machinist, Innes ab Allen
## Wilt John F., clerk, 528 N 2d, h 1114 Columbia av
## Wilt Joseph, chandler, 2327 Coates
## Wilt Joseph L., sheetiron worker, Lancaster av,
## Wilt Paul, heaters, 425 York av
## Wilt William, laborer, r 2325 Coates
## Wilt William, livery stable, 914 Brown, h 10th 13 Brown
## Wilt William, contractor, 719 N 10th
## Wilt William, laborer, Gordon n Cedar
## Wiltbank Daffy, washerw., 3 Price’s ct
## Wiltbank Elizabeth, widow John, 1105 Arch
## Wiltbank Elizabeth M., widow, 1521 Locust
## Wiltbank Samuel P. , broker, 1807 Delancey pl
## Wiltbank W. White, 1521 Locust
## Wiltberger A., druggist, 233 N 2d, h 329 N 5th
## Wiltberger D. S., com. mer., 220 Chestnut, h 329 N 5th _
## Wiltberger Harry A., accountant, Market n 40th
## Wiltberger I. P.. clerk, 309 Branch
## Wiltberger Jacob H., hardware, 225 N 2d, h 711 Wallace
## Wiltberger Richard, tavern, 119 Callowhill
## Wiltberger Theodore M., Market n 40th
## Wiltberger Theodore P., clerk, Market n 40th
## \Vilter George, weaver, S E Dauphin & Amber
## Wilthew Charlton, puddler, 1368 Beach
## VVimer Albert, clerk, 1224 S 6th
## Wimer Annie M., dressmaker, 34 N 8th
## Wimer Augustus, beamer, 13 Cresson, Myk
## Wimer Daniel C., carver, 1402 Mervine
## Wimer Elizabeth B., dry goods, 1511 Callowhill
## Wimer Hannah, wid. Thomas, 1041 Buttonwood
## Wimer John, gauger & cooper, 232 N Broad, h 1511 Callowhill
## Wimer John A., sexton, 210 Bache
## \Vimer John C., cooper, 34 N 8th
## W'imer Joseph, collector, 1224 S 6th
## Wimer Margaret, widow Andrew, 720 S 3d
## \Vimer Wesley P., cooper, 1511 Callowhill
## Wimer William W., bookkeeper P R R 13th & Market, h 1805 Callowhill
## Wimley George H., ship chandler, 512 & 514 S Del av, h 244 Crown
## Wimley John, shoemaker, r 303 Brown
## Wimley William, baker, 244 Crown
## \Vimpfheimer Augustus, salesman, 400 Callowhill
## Wimpfheimer Caroline, widow Abraham, hair dresses & silk nets, 402 N 2d
## Wimpfheimer David, manuf. vinegar,_431 N 3d
## W'impfheimer Jacob, leather, 318 New
## Wimpfheimer Jacob & Co. (Jacob lVi-mpflzeimcr), importer, 400 Callowhill
## Wimpfheimer Joseph, jeweller, 310 N 3d
## Wimpfheimer Maxwell, bookkeeper, 431 N 3d, h 469 N 4th
## Wims Mary S., widow George, Dauphin E Carroll
## Winans Elihu M., tinsmith, 2044 Ridge av
## Winans George, painter, 2044 Ridge av
## Winans Randolph, printer, 2044 Ridge av
## Winberg William H., gentleman, 1428 Marshall
## Winberger Charles, fringes, 120 Coates
## WINCH ALDEN, newspaper ag’t, 320 Chestnut, h Arch ab 13th
## Winch C., spike ma.nuf., Beach ab Warren
## Winchell William E., sailmaker, 7 Grover
## Winchester Augustus, gents’ furnishing goods, 706 Chestnut, h 734 S 9th
## Winchester & C0. (Augustus Wizzcizestcr .5, Wm. S. Marti72.), gents’ fur-’g store, 706 Chestnut
## Winchester James, weaver, Hope bel Putnam
## Winchester John, carpenter, Ridge av, Rox
## Winchester John, weaver, 1612 Philip
## Winchester John, weaver, 135 Thompson
## Winchester John, grocer, 301 Thompson
## Winchester Margaret, wid Robert, 324 Dean
## Winchester Robert, machinist, 135 Thompson
## W'inchester Samuel, merchant, 236 Market, h 258 S 10th
## Winchester William, weaver, 135 Thompson
## Winchester William W., bookkeeper, 307 Branch, h 2101 Oxford
## Windel Hannah, teacher, N 41st 11 Market
## Vllinder Ernest, carpenter, 1124 Sophia
## Winder Frederick, tailor, 1157 Passyunk rd
## Winder Harman, hotel, 926 N Front
## Winder John, driver, Daniel pl
## Winder John B., gentleman, Herman, Gtn
## Winder Joseph, hotel, 76 Frankford
## Winder Robert, carman, 906 N 12th
## Winder Sebastian, shoemaker, Ne1son’s ct
## Winder W. H., mer. 314-}; Walnut, h 415 S 15th
## Winderly Charles, shoemaker, York n Trenton av
## Winderoth Wyant, shoemaker, Champion pl
## Winderstein Frederick, shoemaker, r 1213 Apple
## Windevender David, shipjoiner, 1021 Ross
## Windish Frederick, tailor, 1129 Charlotte
## Vllindle Benjamin, file manuf. r 70 N 2d
## Windle George, salesm. 633 Market, h 1210 S 10th
## Windle William, superintendent, 1210 S 10th
## Windlerwin Julius, bootfitter, 1225 N 2d
## Windles Richard, carpenter, Oxford n Hedge
## Windner John, brickmaker, 138 Diamond
## Windorf Christian, dealer, r 832 Carpenter
## Windorf Frank, dealer, r 832 Carpenter
## Windrim James H., architect, 1518 Sansom
## Winebaker Wilhelmina, wid Charles, 320 Willow
## Wineberg Samuel, beef butcher, stalls 10 cl: 30 Girard av Market, h 944 St John
## Winebrener (K: Co. (Harry C. Wiiwbrevzer @Freclerick L. Pleis), coal dealers 3d & Thompson
## Winebrener David S., hardware, 49 N 3d, h 1627 Vine
## Winebrener David, merchant, 241 S 18th
## Winebrener Harry C., coal dealer, 3d 6: Thompson, h 241 S 18th
## Wineburg John H., tanner, 535 N Front
## Winegar Francis, cabinetmaker, 117 W'alnut, h 235 Shippen
## Winegardener John, barkeeper, 9th cl: Arch, h 5th & Master
## Winegar-dner Adam, laborer, r Hope 11 Canal
## Winegar-dner Andreas, tailor, 1723 N 3d
## Winegartner Anton, gentleman, 1409 Randolph
## Winehold Benjamin, driver, 1214 S 4th
## Winemore John IL, salesman, 16 S 2d, h 1110 S 2d
## Winfiller Andreas, butcher, 1410 Franklin
## Winfield Charles, shipjoiner, 120 China
## Window Shades and Curtain Goods, \‘Vholcsalc and Retail;
答案 2 :(得分:0)
使用strip()
字符串的str
方法。没有参数,它会删除所有类型的前导和尾随空格。